diff --git a/content/posts/2019-05.md b/content/posts/2019-05.md index 111bbd7c5..0dbf04b4e 100644 --- a/content/posts/2019-05.md +++ b/content/posts/2019-05.md @@ -86,4 +86,14 @@ Please see the DSpace documentation for assistance. - I will ask ILRI ICT to reset the password - They reset the password and I tested it on CGSpace +## 2019-05-05 + +- Run all system updates on DSpace Test (linode19) and reboot it +- Merge changes into the `5_x-prod` branch of CGSpace: + - Updates to remove deprecated social media websites (Google+ and Delicious), update Twitter share intent, and add item title to Twitter and email links ([#421](https://github.com/ilri/DSpace/pull/421)) + - Add new CCAFS Phase II project tags ([#420](https://github.com/ilri/DSpace/pull/420)) + - Add item ID to REST API error logging ([#422](https://github.com/ilri/DSpace/pull/422)) +- Re-deploy CGSpace from `5_x-prod` branch +- Run all system updates on CGSpace (linode18) and reboot it + diff --git a/docs/2015-11/index.html b/docs/2015-11/index.html index 01b7a697b..6a53a9513 100644 --- a/docs/2015-11/index.html +++ b/docs/2015-11/index.html @@ -11,11 +11,12 @@ CGSpace went down Looks like DSpace exhausted its PostgreSQL connection pool -Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections: +Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections: $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace 78 + " /> @@ -29,13 +30,14 @@ $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspac CGSpace went down Looks like DSpace exhausted its PostgreSQL connection pool -Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections: +Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections: $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace 78 + "/> - + @@ -119,12 +121,13 @@ $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspac + +
  • Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:

    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
     78
    -
    +
  • + + +
  • Current database settings for DSpace are db.maxconnections = 30 and db.maxidle = 8, yet idle connections are exceeding this:

    $ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
     39
    -
    +
  • - +
  • I restarted PostgreSQL and Tomcat and it’s back

  • + +
  • On a related note of why CGSpace is so slow, I decided to finally try the pgtune script to tune the postgres settings:

    # apt-get install pgtune
     # pgtune -i /etc/postgresql/9.3/main/postgresql.conf -o postgresql.conf-pgtune
     # mv /etc/postgresql/9.3/main/postgresql.conf /etc/postgresql/9.3/main/postgresql.conf.orig 
     # mv postgresql.conf-pgtune /etc/postgresql/9.3/main/postgresql.conf
    -
    +
  • - +
  • It introduced the following new settings:

    default_statistics_target = 50
     maintenance_work_mem = 480MB
    @@ -165,12 +162,11 @@ wal_buffers = 8MB
     checkpoint_segments = 16
     shared_buffers = 1920MB
     max_connections = 80
    -
    +
  • - +
  • Now I need to go read PostgreSQL docs about these options, and watch memory settings in munin etc

  • + +
  • For what it’s worth, now the REST API should be faster (because of these PostgreSQL tweaks):

    $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
     1.474
    @@ -182,11 +178,11 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
     1.995
     $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
     1.786
    -
    +
  • -

    CCAFS item

    @@ -201,19 +197,20 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle + +
  • Idle postgres connections look like this (with no change in DSpace db settings lately):

    $ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
     29
    -
    +
  • - +
  • I restarted Tomcat and postgres…

  • + +
  • Atmire commented that we should raise the JVM heap size by ~500M, so it is now -Xms3584m -Xmx3584m

  • + +
  • We weren’t out of heap yet, but it’s probably fair enough that the DSpace 5 upgrade (and new Atmire modules) requires more memory so it’s ok

  • + +
  • A possible side effect is that I see that the REST API is twice as fast for the request above now:

    $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
     1.368
    @@ -227,22 +224,23 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
     0.806
     $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
     0.854
    -
    +
  • +

    2015-12-05

    + +
  • PostgreSQL idle connections are currently:

    postgres@linode01:~$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
     28
    -
    +
  • -

    PostgreSQL bgwriter (year) @@ -254,8 +252,8 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle

    + +
  • After deploying the fix to CGSpace the REST API is consistently faster:

    $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
     0.675
    @@ -267,7 +265,8 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
     0.566
     $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
     0.497
    -
    +
  • +

    2015-12-08

    diff --git a/docs/2016-01/index.html b/docs/2016-01/index.html index 20df3347e..77c79063f 100644 --- a/docs/2016-01/index.html +++ b/docs/2016-01/index.html @@ -27,7 +27,7 @@ Move ILRI collection 10568/12503 from 10568/27869 to 10568/27629 using the move_ I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated. Update GitHub wiki for documentation of maintenance tasks. "/> - + diff --git a/docs/2016-02/index.html b/docs/2016-02/index.html index bc24031d2..9974706a1 100644 --- a/docs/2016-02/index.html +++ b/docs/2016-02/index.html @@ -41,7 +41,7 @@ I noticed we have a very interesting list of countries on CGSpace: Not only are there 49,000 countries, we have some blanks (25)… Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE” "/> - + @@ -139,41 +139,39 @@ Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE&r + +
  • First, find the metadata_field_id for the field you want from the metadatafieldregistry table:

    dspacetest=# select * from metadatafieldregistry;
    -
    +
  • - +
  • In this case our country field is 78

  • + +
  • Now find all resources with type 2 (item) that have null/empty values for that field:

    dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value='' OR text_value IS NULL);
    -
    +
  • - +
  • Then you can find the handle that owns it from its resource_id:

    dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678';
    -
    +
  • - +
  • It’s 25 items so editing in the web UI is annoying, let’s try SQL!

    dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value='';
     DELETE 25
    -
    +
  • -

    2016-02-07

    @@ -184,8 +182,8 @@ DELETE 25
  • For some reason when you import an Excel file into OpenRefine it exports dates like 1949 to 1949.0 in the CSV
  • I re-import the resulting CSV and run a GREL on the date issued column: value.replace("\.0", "")
  • I need to start running DSpace in Mac OS X instead of a Linux VM
  • -
  • Install PostgreSQL from homebrew, then configure and import CGSpace database dump:
  • - + +
  • Install PostgreSQL from homebrew, then configure and import CGSpace database dump:

    $ postgres -D /opt/brew/var/postgres
     $ createuser --superuser postgres
    @@ -200,11 +198,9 @@ postgres=# alter user dspacetest nocreateuser;
     postgres=# \q
     $ vacuumdb dspacetest
     $ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest -h localhost
    -
    +
  • - +
  • After building and running a fresh_install I symlinked the webapps into Tomcat’s webapps folder:

    $ mv /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT.orig
     $ ln -sfv ~/dspace/webapps/xmlui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT
    @@ -213,22 +209,20 @@ $ ln -sfv ~/dspace/webapps/jspui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/
     $ ln -sfv ~/dspace/webapps/oai /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/oai
     $ ln -sfv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/solr
     $ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start
    -
    +
  • - +
  • Add CATALINA_OPTS in /opt/brew/Cellar/tomcat/8.0.30/libexec/bin/setenv.sh, as this script is sourced by the catalina startup script

  • + +
  • For example:

    CATALINA_OPTS="-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8"
    -
    +
  • - +
  • After verifying that the site is working, start a full index:

    $ ~/dspace/bin/dspace index-discovery -b
    -
    +
  • +

    2016-02-08

    @@ -245,8 +239,8 @@ $ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start + +
  • Enable HTTPS on DSpace Test using Let’s Encrypt:

    $ cd ~/src/git
     $ git clone https://github.com/letsencrypt/letsencrypt
    @@ -256,39 +250,36 @@ $ sudo service nginx stop
     $ ./letsencrypt-auto certonly --standalone -d dspacetest.cgiar.org
     $ sudo service nginx start
     $ ansible-playbook dspace.yml -l linode02 -t nginx,firewall -u aorth --ask-become-pass
    -
    +
  • - +
  • We should install it in /opt/letsencrypt and then script the renewal script, but first we have to wire up some variables and template stuff based on the script here: https://letsencrypt.org/howitworks/

  • + +
  • I had to export some CIAT items that were being cleaned up on the test server and I noticed their dc.contributor.author fields have DSpace 5 authority index UUIDs…

  • + +
  • To clean those up in OpenRefine I used this GREL expression: value.replace(/::\w{8}-\w{4}-\w{4}-\w{4}-\w{12}::600/,"")

  • + +
  • Getting more and more hangs on DSpace Test, seemingly random but also during CSV import

  • + +
  • Logs don’t always show anything right when it fails, but eventually one of these appears:

    org.dspace.discovery.SearchServiceException: Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space
    -
    +
  • - +
  • or

    Caused by: java.util.NoSuchElementException: Timeout waiting for idle object
    -
    +
  • - +
  • Right now DSpace Test’s Tomcat heap is set to 1536m and we have quite a bit of free RAM:

    # free -m
    -             total       used       free     shared    buffers     cached
    +         total       used       free     shared    buffers     cached
     Mem:          3950       3902         48          9         37       1311
     -/+ buffers/cache:       2552       1397
     Swap:          255         57        198
    -
    +
  • -

    2016-02-11

    @@ -296,15 +287,13 @@ Swap: 255 57 198 + +
  • I created a filename column based on the dc.identifier.url column using the following transform:

    value.split('/')[-1]
    -
    +
  • - +
  • Then I wrote a tool called generate-thumbnails.py to download the PDFs and generate thumbnails for them, for example:

    $ ./generate-thumbnails.py ciat-reports.csv
     Processing 64661.pdf
    @@ -313,7 +302,8 @@ Processing 64661.pdf
     Processing 64195.pdf
     > Downloading 64195.pdf
     > Creating thumbnail for 64195.pdf
    -
    +
  • +

    2016-02-12

    @@ -330,44 +320,47 @@ Processing 64195.pdf + +
  • 265 items have dirty, URL-encoded filenames:

    $ ls | grep -c -E "%"
     265
    -
    +
  • - +
  • I suggest that we import ~850 or so of the clean ones first, then do the rest after I can find a clean/reliable way to decode the filenames

  • + +
  • This python2 snippet seems to work in the CLI, but not so well in OpenRefine:

    $ python -c "import urllib, sys; print urllib.unquote(sys.argv[1])" CIAT_COLOMBIA_000169_T%C3%A9cnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
     CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
    -
    +
  • -

    2016-02-16

    +
  • Turns out OpenRefine has an unescape function!

    value.unescape("url")
    -
    +
  • -

    2016-02-17

    @@ -383,40 +376,39 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_ + +
  • Also, it seems DSpace’s SAF import tool doesn’t like importing filenames that have accents in them:

    java.io.FileNotFoundException: /usr/share/tomcat7/SimpleArchiveFormat/item_1021/CIAT_COLOMBIA_000075_Medición_de_palatabilidad_en_forrajes.pdf (No such file or directory)
    -
    +
  • -

    2016-02-22

    +
  • To change Spanish accents to ASCII in OpenRefine:

    value.replace('ó','o').replace('í','i').replace('á','a').replace('é','e').replace('ñ','n')
    -
    +
  • - +
  • But actually, the accents might not be an issue, as I can successfully import files containing Spanish accents on my Mac

  • + +
  • On closer inspection, I can import files with the following names on Linux (DSpace Test):

    Bitstream: tést.pdf
     Bitstream: tést señora.pdf
     Bitstream: tést señora alimentación.pdf
    -
    +
  • -

    2016-02-29

    @@ -433,15 +425,15 @@ Bitstream: tést señora alimentación.pdf
  • Trying to test Atmire’s series of stats and CUA fixes from January and February, but their branch history is really messy and it’s hard to see what’s going on
  • Rebasing their branch on top of our production branch results in a broken Tomcat, so I’m going to tell them to fix their history and make a proper pull request
  • Looking at the filenames for the CIAT Reports, some have some really ugly characters, like: ' or , or = or [ or ] or ( or ) or _.pdf or ._ etc
  • -
  • It’s tricky to parse those things in some programming languages so I’d rather just get rid of the weird stuff now in OpenRefine:
  • - + +
  • It’s tricky to parse those things in some programming languages so I’d rather just get rid of the weird stuff now in OpenRefine:

    value.replace("'",'').replace('_=_','_').replace(',','').replace('[','').replace(']','').replace('(','').replace(')','').replace('_.pdf','.pdf').replace('._','_')
    -
    +
  • - diff --git a/docs/2016-03/index.html b/docs/2016-03/index.html index 0f896e6ac..a56ca023f 100644 --- a/docs/2016-03/index.html +++ b/docs/2016-03/index.html @@ -27,7 +27,7 @@ Looking at issues with author authorities on CGSpace For some reason we still have the index-lucene-update cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server "/> - + @@ -121,11 +121,12 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
  • Their changes on 5_x-dev branch work, but it is messy as hell with merge commits and old branch base
  • When I rebase their branch on the latest 5_x-prod I get blank white pages
  • I identified one commit that causes the issue and let them know
  • -
  • Restart DSpace Test, as it seems to have crashed after Sisay tried to import some CSV or zip or something:
  • - + +
  • Restart DSpace Test, as it seems to have crashed after Sisay tried to import some CSV or zip or something:

    Exception in thread "Lucene Merge Thread #19" org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device
    -
    +
  • +

    2016-03-08

    @@ -185,28 +186,28 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
  • More discussion on the GitHub issue here: https://github.com/ilri/DSpace/pull/182
  • Clean up Atmire CUA config (#193)
  • Help Sisay with some PostgreSQL queries to clean up the incorrect dc.contributor.corporateauthor field
  • -
  • I noticed that we have some weird values in dc.language:
  • - + +
  • I noticed that we have some weird values in dc.language:

    # select * from metadatavalue where metadata_field_id=37;
    - metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
    +metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
     -------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------
    -           1942571 |       35342 |                37 | hi         |           |     1 |           |         -1 |                2
    -           1942468 |       35345 |                37 | hi         |           |     1 |           |         -1 |                2
    -           1942479 |       35337 |                37 | hi         |           |     1 |           |         -1 |                2
    -           1942505 |       35336 |                37 | hi         |           |     1 |           |         -1 |                2
    -           1942519 |       35338 |                37 | hi         |           |     1 |           |         -1 |                2
    -           1942535 |       35340 |                37 | hi         |           |     1 |           |         -1 |                2
    -           1942555 |       35341 |                37 | hi         |           |     1 |           |         -1 |                2
    -           1942588 |       35343 |                37 | hi         |           |     1 |           |         -1 |                2
    -           1942610 |       35346 |                37 | hi         |           |     1 |           |         -1 |                2
    -           1942624 |       35347 |                37 | hi         |           |     1 |           |         -1 |                2
    -           1942639 |       35339 |                37 | hi         |           |     1 |           |         -1 |                2
    -
    + 1942571 | 35342 | 37 | hi | | 1 | | -1 | 2 + 1942468 | 35345 | 37 | hi | | 1 | | -1 | 2 + 1942479 | 35337 | 37 | hi | | 1 | | -1 | 2 + 1942505 | 35336 | 37 | hi | | 1 | | -1 | 2 + 1942519 | 35338 | 37 | hi | | 1 | | -1 | 2 + 1942535 | 35340 | 37 | hi | | 1 | | -1 | 2 + 1942555 | 35341 | 37 | hi | | 1 | | -1 | 2 + 1942588 | 35343 | 37 | hi | | 1 | | -1 | 2 + 1942610 | 35346 | 37 | hi | | 1 | | -1 | 2 + 1942624 | 35347 | 37 | hi | | 1 | | -1 | 2 + 1942639 | 35339 | 37 | hi | | 1 | | -1 | 2 +
  • -

    2016-03-17

    @@ -236,14 +237,12 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja

    Trimmed thumbnail

    +
  • Command used:

    $ gm convert -trim -quality 82 -thumbnail x300 -flatten Descriptor\ for\ Butia_EN-2015_2021.pdf\[0\] cover.jpg
    -
    +
  • -

    2016-03-21

    @@ -295,15 +294,14 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja

    2016-03-23

    +
  • Abenet is having problems saving group memberships, and she gets this error: https://gist.github.com/alanorth/87281c061c2de57b773e

    Can't find method org.dspace.app.xmlui.aspect.administrative.FlowGroupUtils.processSaveGroup(org.dspace.core.Context,number,string,[Ljava.lang.String;,[Ljava.lang.String;,org.apache.cocoon.environment.wrapper.RequestWrapper). (resource://aspects/Administrative/administrative.js#967)
    -
    +
  • -

    2016-03-24

    diff --git a/docs/2016-04/index.html b/docs/2016-04/index.html index 96ab663da..a9f40ad6c 100644 --- a/docs/2016-04/index.html +++ b/docs/2016-04/index.html @@ -31,7 +31,7 @@ After running DSpace for over five years I’ve never needed to look in any This will save us a few gigs of backup space we’re paying for on S3 Also, I noticed the checker log has some errors we should pay attention to: "/> - + @@ -154,42 +154,40 @@ java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290

    2016-04-05

    +
  • Reduce Amazon S3 storage used for logs from 46 GB to 6GB by deleting a bunch of logs we don’t need!

    # s3cmd ls s3://cgspace.cgiar.org/log/ > /tmp/s3-logs.txt
     # grep checker.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
     # grep cocoon.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
     # grep handle-plugin.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
     # grep solr.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
    -
    +
  • -

    2016-04-06

    +
  • A better way to move metadata on this scale is via SQL, for example dc.type.output → dc.type (their IDs in the metadatafieldregistry are 66 and 109, respectively):

    dspacetest=# update metadatavalue set metadata_field_id=109 where metadata_field_id=66;
     UPDATE 40852
    -
    +
  • -

    2016-04-07

    + +
  • Testing with a few fields it seems to work well:

    $ ./migrate-fields.sh
     UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66
    @@ -198,7 +196,8 @@ UPDATE metadatavalue SET metadata_field_id=202 WHERE metadata_field_id=72
     UPDATE 21420
     UPDATE metadatavalue SET metadata_field_id=203 WHERE metadata_field_id=76
     UPDATE 51258
    -
    +
  • +

    2016-04-08

    @@ -211,23 +210,22 @@ UPDATE 51258 + +
  • It seems the dx.doi.org URLs are much more proper in our repository!

    dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://dx.doi.org%';
    - count
    +count
     -------
    -  5638
    +5638
     (1 row)
     
     dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://doi.org%';
    - count
    +count
     -------
    -     3
    -
    + 3 +
  • -

    2016-04-11

    @@ -240,38 +238,41 @@ dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and t

    2016-04-12

    +
  • Looking at quality of WLE data (cg.subject.iwmi) in SQL:

    dspacetest=# select text_value, count(*) from metadatavalue where metadata_field_id=217 group by text_value order by count(*) desc;
    -
    +
  • - +
  • Listings and Reports is still not returning reliable data for dc.type

  • + +
  • I think we need to ask Atmire, as their documentation isn’t too clear on the format of the filter configs

  • + +
  • Alternatively, I want to see if I move all the data from dc.type.output to dc.type and then re-index, if it behaves better

  • + +
  • Looking at our input-forms.xml I see we have two sets of ILRI subjects, but one has a few extra subjects

  • + +
  • Remove one set of ILRI subjects and remove duplicate VALUE CHAINS from existing list (#216)

  • + +
  • I decided to keep the set of subjects that had FMD and RANGELANDS added, as it appears to have been requested to have been added, and might be the newer list

  • + +
  • I found 226 blank metadatavalues:

    dspacetest# select * from metadatavalue where resource_type_id=2 and text_value='';
    -
    +
  • - +
  • I think we should delete them and do a full re-index:

    dspacetest=# delete from metadatavalue where resource_type_id=2 and text_value='';
     DELETE 226
    -
    +
  • -

    2016-04-14

    @@ -315,8 +316,8 @@ DELETE 226
  • cg.livestock.agegroup: 9 items, in ILRI collections
  • cg.livestock.function: 20 items, mostly in EADD
  • -
  • Test metadata migration on local instance again:
  • - + +
  • Test metadata migration on local instance again:

    $ ./migrate-fields.sh
     UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66
    @@ -332,95 +333,88 @@ UPDATE 3872
     UPDATE metadatavalue SET metadata_field_id=217 WHERE metadata_field_id=108
     UPDATE 46075
     $ JAVA_OPTS="-Xms512m -Xmx512m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace index-discovery -bf
    -
    +
  • - +
  • CGSpace was down but I’m not sure why, this was in catalina.out:

    Apr 18, 2016 7:32:26 PM com.sun.jersey.spi.container.ContainerResponse logException
     SEVERE: Mapped exception to response: 500 (Internal Server Error)
     javax.ws.rs.WebApplicationException
    -        at org.dspace.rest.Resource.processFinally(Resource.java:163)
    -        at org.dspace.rest.HandleResource.getObject(HandleResource.java:81)
    -        at sun.reflect.GeneratedMethodAccessor198.invoke(Unknown Source)
    -        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -        at java.lang.reflect.Method.invoke(Method.java:606)
    -        at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
    -        at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
    -        at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
    -        at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
    -        at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
    -        at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
    -        at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
    -        at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
    -        at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1511)
    -        at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1442)
    -        at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1391)
    -        at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1381)
    -        at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
    +    at org.dspace.rest.Resource.processFinally(Resource.java:163)
    +    at org.dspace.rest.HandleResource.getObject(HandleResource.java:81)
    +    at sun.reflect.GeneratedMethodAccessor198.invoke(Unknown Source)
    +    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    +    at java.lang.reflect.Method.invoke(Method.java:606)
    +    at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
    +    at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
    +    at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
    +    at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
    +    at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
    +    at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
    +    at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
    +    at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
    +    at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1511)
    +    at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1442)
    +    at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1391)
    +    at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1381)
    +    at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
     ...
    -
    +
  • -

    2016-04-19

    +
  • Get handles for items that are using a given metadata field, ie dc.Species.animal (105):

    # select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=105);
    -   handle
    +handle
     -------------
    - 10568/10298
    - 10568/16413
    - 10568/16774
    - 10568/34487
    -
    +10568/10298 +10568/16413 +10568/16774 +10568/34487 +
  • - +
  • Delete metadata values for dc.GRP and dc.icsubject.icrafsubject:

    # delete from metadatavalue where resource_type_id=2 and metadata_field_id=96;
     # delete from metadatavalue where resource_type_id=2 and metadata_field_id=83;
    -
    +
  • - +
  • They are old ICRAF fields and we haven’t used them since 2011 or so

  • + +
  • Also delete them from the metadata registry

  • + +
  • CGSpace went down again, dspace.log had this:

    2016-04-19 15:02:17,025 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
     org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
    -
    +
  • - +
  • I restarted Tomcat and PostgreSQL and now it’s back up

  • + +
  • I bet this is the same crash as yesterday, but I only saw the errors in catalina.out

  • + +
  • Looks to be related to this, from dspace.log:

    2016-04-19 15:16:34,670 ERROR org.dspace.rest.Resource @ Something get wrong. Aborting context in finally statement.
    -
    +
  • - +
  • We have 18,000 of these errors right now…

  • + +
  • Delete a few more old metadata values: dc.Species.animal, dc.type.journal, and dc.publicationcategory:

    # delete from metadatavalue where resource_type_id=2 and metadata_field_id=105;
     # delete from metadatavalue where resource_type_id=2 and metadata_field_id=85;
     # delete from metadatavalue where resource_type_id=2 and metadata_field_id=95;
    -
    +
  • -

    2016-04-20

    @@ -428,8 +422,8 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error + +
  • Field migration went well:

    $ ./migrate-fields.sh
     UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66
    @@ -444,22 +438,23 @@ UPDATE metadatavalue SET metadata_field_id=215 WHERE metadata_field_id=106
     UPDATE 3872
     UPDATE metadatavalue SET metadata_field_id=217 WHERE metadata_field_id=108
     UPDATE 46075
    -
    +
  • - +
  • Also, I migrated CGSpace to using the PGDG PostgreSQL repo as the infrastructure playbooks had been using it for a while and it seemed to be working well

  • + +
  • Basically, this gives us the ability to use the latest upstream stable 9.3.x release (currently 9.3.12)

  • + +
  • Looking into the REST API errors again, it looks like these started appearing a few days ago in the tens of thousands:

    $ grep -c "Aborting context in finally statement" dspace.log.2016-04-20
     21252
    -
    +
  • -

    2016-04-21

    @@ -496,8 +491,8 @@ UPDATE 46075 + +
  • I think there must be something with this REST stuff:

    # grep -c "Aborting context in finally statement" dspace.log.2016-04-*
     dspace.log.2016-04-01:0
    @@ -527,15 +522,19 @@ dspace.log.2016-04-24:28775
     dspace.log.2016-04-25:28626
     dspace.log.2016-04-26:28655
     dspace.log.2016-04-27:7271
    -
    +
  • -

    2016-04-28

    @@ -548,17 +547,15 @@ dspace.log.2016-04-27:7271

    2016-04-30

    +
  • Logs for today and yesterday have zero references to this REST error, so I’m going to open back up the REST API but log all requests

    location /rest {
     	access_log /var/log/nginx/rest.log;
     	proxy_pass http://127.0.0.1:8443;
     }
    -
    +
  • - diff --git a/docs/2016-05/index.html b/docs/2016-05/index.html index 35f1f4cfa..a34893d04 100644 --- a/docs/2016-05/index.html +++ b/docs/2016-05/index.html @@ -11,11 +11,12 @@ Since yesterday there have been 10,000 REST errors and the site has been unstable again I have blocked access to the API now -There are 3,000 IPs accessing the REST API in a 24-hour period! +There are 3,000 IPs accessing the REST API in a 24-hour period! # awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l 3168 + " /> @@ -29,13 +30,14 @@ There are 3,000 IPs accessing the REST API in a 24-hour period! Since yesterday there have been 10,000 REST errors and the site has been unstable again I have blocked access to the API now -There are 3,000 IPs accessing the REST API in a 24-hour period! +There are 3,000 IPs accessing the REST API in a 24-hour period! # awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l 3168 + "/> - + @@ -119,24 +121,25 @@ There are 3,000 IPs accessing the REST API in a 24-hour period! + +
  • There are 3,000 IPs accessing the REST API in a 24-hour period!

    # awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
     3168
    -
    +
  • + + +
  • 100% of the requests coming from Ethiopia are like this and result in an HTTP 500:

    GET /rest/handle/10568/NaN?expand=parentCommunityList,metadata HTTP/1.1
    -
    +
  • -

    2016-05-03

    @@ -156,8 +159,8 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
  • Hmm, also disk space is full
  • I decided to blow away the solr indexes, since they are 50GB and we don’t really need all the Atmire stuff there right now
  • I will re-generate the Discovery indexes after re-deploying
  • -
  • Testing renew-letsencrypt.sh script for nginx
  • - + +
  • Testing renew-letsencrypt.sh script for nginx

    #!/usr/bin/env bash
     
    @@ -174,16 +177,15 @@ LE_RESULT=$?
     $SERVICE_BIN nginx start
     
     if [[ "$LE_RESULT" != 0 ]]; then
    -    echo 'Automated renewal failed:'
    +echo 'Automated renewal failed:'
     
    -    cat /var/log/letsencrypt/renew.log
    +cat /var/log/letsencrypt/renew.log
     
    -    exit 1
    +exit 1
     fi
    -
    +
  • -

    2016-05-10

    @@ -221,17 +223,18 @@ fi
  • There were a handful of conflicts that I didn’t understand

  • -
  • After completing the rebase I tried to build with the module versions Atmire had indicated as being 5.5 ready but I got this error:

  • - +
  • After completing the rebase I tried to build with the module versions Atmire had indicated as being 5.5 ready but I got this error:

    [ERROR] Failed to execute goal on project additions: Could not resolve dependencies for project org.dspace.modules:additions:jar:5.5: Could not find artifact com.atmire:atmire-metadata-quality-api:jar:5.5-2.10.1-0 in sonatype-releases (https://oss.sonatype.org/content/repositories/releases/) -> [Help 1]
    -
    +
  • -

    2016-05-12

    @@ -252,11 +255,12 @@ fi
  • Our dc.contributor.affiliation and dc.contributor.corporate could both map to dc.contributor and possibly dc.contributor.center depending on if it’s a CG center or not
  • dc.title.jtitle could either map to dc.publisher or dc.source depending on how you read things
  • -
  • Found ~200 messed up CIAT values in dc.publisher:
  • - + +
  • Found ~200 messed up CIAT values in dc.publisher:

    # select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=39 and text_value similar to "%  %";
    -
    +
  • +

    2016-05-13

    @@ -277,65 +281,65 @@ fi + +
  • In OpenRefine I created a new filename column based on the thumbnail column with the following GREL:

    if(cells['thumbnails'].value.contains('hqdefault'), cells['thumbnails'].value.split('/')[-2] + '.jpg', cells['thumbnails'].value.split('/')[-1])
    -
    +
  • -

    2016-05-19

    +
  • More quality control on filename field of CCAFS records to make processing in shell and SAFBuilder more reliable:

    value.replace('_','').replace('-','')
    -
    +
  • - + +
  • This ought to catch all the CPWF values (there don’t appear to be and SG* values):

    # select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
    -
    +
  • + +

    2016-05-20

    + +
  • For SAFBuilder we need to modify filename column to have the thumbnail bundle:

    value + "__bundle:THUMBNAIL"
    -
    +
  • - +
  • Also, I fixed some weird characters using OpenRefine’s transform with the following GREL:

    value.replace(/\u0081/,'')
    -
    +
  • -

    2016-05-23

    @@ -350,46 +354,44 @@ fi

    2016-05-30

    +
  • Export CCAFS video and image records from DSpace Test using the migrate option (-m):

    $ mkdir ~/ccafs-images
     $ /home/dspacetest.cgiar.org/bin/dspace export -t COLLECTION -i 10568/79355 -d ~/ccafs-images -n 0 -m
    -
    +
  • - +
  • And then import to CGSpace:

    $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/70974 --source /tmp/ccafs-images --mapfile=/tmp/ccafs-images-may30.map &> /tmp/ccafs-images-may30.log
    -
    +
  • - +
  • But now we have double authors for “CGIAR Research Program on Climate Change, Agriculture and Food Security” in the authority

  • + +
  • I’m trying to do a Discovery index before messing with the authority index

  • + +
  • Looks like we are missing the index-authority cron job, so who knows what’s up with our authority index

  • + +
  • Run system updates on DSpace Test, re-deploy code, and reboot the server

  • + +
  • Clean up and import ~200 CTA records to CGSpace via CSV like:

    $ export JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8"
     $ /home/cgspace.cgiar.org/bin/dspace metadata-import -e aorth@mjanja.ch -f ~/CTA-May30/CTA-42229.csv &> ~/CTA-May30/CTA-42229.log
    -
    +
  • - +
  • Discovery indexing took a few hours for some reason, and after that I started the index-authority script

    $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace index-authority
    -
    +
  • +

    2016-05-31

    + +
  • I am running it again with a timer to see:

    $ time /home/cgspace.cgiar.org/bin/dspace index-authority
     Retrieving all data
    @@ -401,14 +403,17 @@ All done !
     real    37m26.538s
     user    2m24.627s
     sys     0m20.540s
    -
    +
  • - diff --git a/docs/2016-06/index.html b/docs/2016-06/index.html index 6e55bd771..a0b2816c1 100644 --- a/docs/2016-06/index.html +++ b/docs/2016-06/index.html @@ -33,7 +33,7 @@ This is their publications set: http://ebrary.ifpri.org/oai/oai.php?verb=ListRec You can see the others by using the OAI ListSets verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in dc.identifier.fund to cg.identifier.cpwfproject and then the rest to dc.description.sponsorship "/> - + @@ -137,91 +137,88 @@ UPDATE 14 + +
  • Seems that the Browse configuration in dspace.cfg can’t handle the ‘-’ in the field name:

    webui.browse.index.12 = subregion:metadata:cg.coverage.admin-unit:text
    -
    +
  • -

    2016-06-03

    + +
  • The top two authors are:

    CGIAR Research Program on Climate Change, Agriculture and Food Security::acd00765-02f1-4b5b-92fa-bfa3877229ce::500
     CGIAR Research Program on Climate Change, Agriculture and Food Security::acd00765-02f1-4b5b-92fa-bfa3877229ce::600
    -
    +
  • - +
  • So the only difference is the “confidence”

  • + +
  • Ok, well THAT is interesting:

    dspacetest=# select text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like '%Orth, %';
    - text_value |              authority               | confidence
    +text_value |              authority               | confidence
     ------------+--------------------------------------+------------
    - Orth, A.   | ab606e3a-2b04-4c7d-9423-14beccf54257 |         -1
    - Orth, A.   | ab606e3a-2b04-4c7d-9423-14beccf54257 |         -1
    - Orth, A.   | ab606e3a-2b04-4c7d-9423-14beccf54257 |         -1
    - Orth, Alan |                                      |         -1
    - Orth, Alan |                                      |         -1
    - Orth, Alan |                                      |         -1
    - Orth, Alan |                                      |         -1
    - Orth, A.   | 05c2c622-d252-4efb-b9ed-95a07d3adf11 |         -1
    - Orth, A.   | 05c2c622-d252-4efb-b9ed-95a07d3adf11 |         -1
    - Orth, A.   | ab606e3a-2b04-4c7d-9423-14beccf54257 |         -1
    - Orth, A.   | ab606e3a-2b04-4c7d-9423-14beccf54257 |         -1
    - Orth, Alan | ad281dbf-ef81-4007-96c3-a7f5d2eaa6d9 |        600
    - Orth, Alan | ad281dbf-ef81-4007-96c3-a7f5d2eaa6d9 |        600
    +Orth, A.   | ab606e3a-2b04-4c7d-9423-14beccf54257 |         -1
    +Orth, A.   | ab606e3a-2b04-4c7d-9423-14beccf54257 |         -1
    +Orth, A.   | ab606e3a-2b04-4c7d-9423-14beccf54257 |         -1
    +Orth, Alan |                                      |         -1
    +Orth, Alan |                                      |         -1
    +Orth, Alan |                                      |         -1
    +Orth, Alan |                                      |         -1
    +Orth, A.   | 05c2c622-d252-4efb-b9ed-95a07d3adf11 |         -1
    +Orth, A.   | 05c2c622-d252-4efb-b9ed-95a07d3adf11 |         -1
    +Orth, A.   | ab606e3a-2b04-4c7d-9423-14beccf54257 |         -1
    +Orth, A.   | ab606e3a-2b04-4c7d-9423-14beccf54257 |         -1
    +Orth, Alan | ad281dbf-ef81-4007-96c3-a7f5d2eaa6d9 |        600
    +Orth, Alan | ad281dbf-ef81-4007-96c3-a7f5d2eaa6d9 |        600
     (13 rows)
    -
    +
  • - +
  • And now an actually relevent example:

    dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence = 500;
    - count
    +count
     -------
    -   707
    +707
     (1 row)
     
     dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence != 500;
    - count
    +count
     -------
    -   253
    +253
     (1 row)
    -
    +
  • - +
  • Trying something experimental:

    dspacetest=# update metadatavalue set confidence=500 where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
     UPDATE 960
    -
    +
  • - +
  • And then re-indexing authority and Discovery…?

  • + +
  • After Discovery reindex the CCAFS authors are all together in the Authors sidebar facet

  • + +
  • The docs for the ORCiD and Authority stuff for DSpace 5 mention changing the browse indexes to use the Authority as well:

    webui.browse.index.2 = author:metadataAuthority:dc.contributor.author:authority
    -
    +
  • -

    2016-06-04

    @@ -235,13 +232,11 @@ UPDATE 960

    2016-06-07

    +
  • Figured out how to export a list of the unique values from a metadata field ordered by count:

    dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=29 group by text_value order by count desc) to /tmp/sponsorship.csv with csv;
    -
    +
  • -

    2016-07-13

    @@ -242,36 +238,33 @@ $ ./delete-metadata-values.py -f dc.contributor.author -i /tmp/Authors-Delete-UT + +
  • CGSpace crashed late at night and the DSpace logs were showing:

    2016-07-18 20:26:30,941 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - 
     org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
     ...
    -
    +
  • - +
  • I suspect it’s someone hitting REST too much:

    # awk '{print $1}' /var/log/nginx/rest.log  | sort -n | uniq -c | sort -h | tail -n 3
    -    710 66.249.78.38
    -   1781 181.118.144.29
    -  24904 70.32.99.142
    -
    +710 66.249.78.38 +1781 181.118.144.29 +24904 70.32.99.142 +
  • - -
         # log rest requests
    -     location /rest {
    -         access_log /var/log/nginx/rest.log;
    -         proxy_pass http://127.0.0.1:8443;
    -         deny 70.32.99.142;
    -     }
    -
    -

    2016-07-21

    + +
  • Trying these on DSpace Test after a discussion by Daniel Scharon on the dspace-tech mailing list:

    index.authority.ignore-prefered.dc.contributor.author=true
     index.authority.ignore-variants.dc.contributor.author=false
    -
    +
  • - +
  • After reindexing I don’t see any change in Discovery’s display of authors, and still have entries like:

    Grace, D. (464)
     Grace, D. (62)
    -
    +
  • - +
  • I asked for clarification of the following options on the DSpace mailing list:

    index.authority.ignore
     index.authority.ignore-prefered
     index.authority.ignore-variants
    -
    +
  • - +
  • In the mean time, I will try these on DSpace Test (plus a reindex):

    index.authority.ignore=true
     index.authority.ignore-prefered=true
     index.authority.ignore-variants=true
    -
    +
  • - +
  • Enabled usage of X-Forwarded-For in DSpace admin control panel (#255

  • + +
  • It was misconfigured and disabled, but already working for some reason sigh

  • + +
  • … no luck. Trying with just:

    index.authority.ignore=true
    -
    +
  • -

    2016-07-25

    +
  • Trying a few more settings (plus reindex) for Discovery on DSpace Test:

    index.authority.ignore-prefered.dc.contributor.author=true
     index.authority.ignore-variants=true
    -
    +
  • -

    About page

    +
  • The DSpace source code mentions the configuration key discovery.index.authority.ignore-prefered.* (with prefix of discovery, despite the docs saying otherwise), so I’m trying the following on DSpace Test:

    discovery.index.authority.ignore-prefered.dc.contributor.author=true
     discovery.index.authority.ignore-variants=true
    -
    +
  • -

    2016-07-31

    diff --git a/docs/2016-08/index.html b/docs/2016-08/index.html index f30927bbd..0ecc8539a 100644 --- a/docs/2016-08/index.html +++ b/docs/2016-08/index.html @@ -14,12 +14,13 @@ Play with upgrading Mirage 2 dependencies in bower.json because most are several Bootstrap is at 3.3.0 but upstream is at 3.3.7, and upgrading to anything beyond 3.3.1 breaks glyphicons and probably more bower stuff is a dead end, waste of time, too many issues Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of fonts) -Start working on DSpace 5.1 → 5.5 port: +Start working on DSpace 5.1 → 5.5 port: $ git checkout -b 55new 5_x-prod $ git reset --hard ilri/5_x-prod $ git rebase -i dspace-5.5 + " /> @@ -36,14 +37,15 @@ Play with upgrading Mirage 2 dependencies in bower.json because most are several Bootstrap is at 3.3.0 but upstream is at 3.3.7, and upgrading to anything beyond 3.3.1 breaks glyphicons and probably more bower stuff is a dead end, waste of time, too many issues Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of fonts) -Start working on DSpace 5.1 → 5.5 port: +Start working on DSpace 5.1 → 5.5 port: $ git checkout -b 55new 5_x-prod $ git reset --hard ilri/5_x-prod $ git rebase -i dspace-5.5 + "/> - + @@ -130,13 +132,14 @@ $ git rebase -i dspace-5.5
  • Bootstrap is at 3.3.0 but upstream is at 3.3.7, and upgrading to anything beyond 3.3.1 breaks glyphicons and probably more
  • bower stuff is a dead end, waste of time, too many issues
  • Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of fonts)
  • -
  • Start working on DSpace 5.1 → 5.5 port:
  • - + +
  • Start working on DSpace 5.1 → 5.5 port:

    $ git checkout -b 55new 5_x-prod
     $ git reset --hard ilri/5_x-prod
     $ git rebase -i dspace-5.5
    -
    +
  • +

    2016-08-06

    @@ -194,20 +198,18 @@ $ git rebase -i dspace-5.5
  • Ooh, and vanilla DSpace 5.5 works on Tomcat 8 with Java 8!
  • Some notes about setting up Tomcat 8, since it’s new on this machine…
  • Install latest Oracle Java 8 JDK
  • -
  • Create setenv.sh in Tomcat 8 libexec/bin directory: -
  • - + +
  • Create setenv.sh in Tomcat 8 libexec/bin directory:

    CATALINA_OPTS="-Djava.awt.headless=true -Xms3072m -Xmx3072m -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Dfile.encoding=UTF-8"
     CATALINA_OPTS="$CATALINA_OPTS -Djava.library.path=/opt/brew/Cellar/tomcat-native/1.2.8/lib"
     
     JRE_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/Home
    -
    +
  • - +
  • Edit Tomcat 8 server.xml to add regular HTTP listener for solr

  • + +
  • Symlink webapps:

    $ rm -rf /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/ROOT
     $ ln -sv ~/dspace/webapps/xmlui /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/ROOT
    @@ -215,7 +217,8 @@ $ ln -sv ~/dspace/webapps/oai /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/oai
     $ ln -sv ~/dspace/webapps/jspui /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/jspui
     $ ln -sv ~/dspace/webapps/rest /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/rest
     $ ln -sv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/solr
    -
    +
  • +

    2016-08-09

    @@ -280,14 +283,13 @@ $ ln -sv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/sol + +
  • Also need to fix existing records using the incorrect form in the database:

    dspace=# update metadatavalue set text_value='CONGO, DR' where resource_type_id=2 and metadata_field_id=228 and text_value='CONGO,DR';
    -
    +
  • -

    2016-08-21

    @@ -303,8 +305,7 @@ $ ln -sv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/sol

    2016-08-22

    +
  • Database migrations are fine on DSpace 5.1:

    $ ~/dspace/bin/dspace database info
     
    @@ -335,10 +336,9 @@ Database Driver: PostgreSQL Native Driver version PostgreSQL 9.1 JDBC4 (build 90
     | 5.1.2015.12.03 | Atmire CUA 4 migration     | 2016-03-21 17:10:41 | Success |
     | 5.1.2015.12.03 | Atmire MQM migration       | 2016-03-21 17:10:42 | Success |
     +----------------+----------------------------+---------------------+---------+
    -
    +
  • -

    2016-08-23

    @@ -346,98 +346,92 @@ Database Driver: PostgreSQL Native Driver version PostgreSQL 9.1 JDBC4 (build 90 + +
  • They said I should delete the Atmire migrations

    dspacetest=# delete from schema_version where description =  'Atmire CUA 4 migration' and version='5.1.2015.12.03.2';
     dspacetest=# delete from schema_version where description =  'Atmire MQM migration' and version='5.1.2015.12.03.3';
    -
    +
  • - +
  • After that DSpace starts up by XMLUI now has unrelated issues that I need to solve!

    org.apache.avalon.framework.configuration.ConfigurationException: Type 'ThemeResourceReader' does not exist for 'map:read' at jndi:/localhost/themes/0_CGIAR/sitemap.xmap:136:77
     context:/jndi:/localhost/themes/0_CGIAR/sitemap.xmap - 136:77
    -
    +
  • -

    2016-08-24

    + +
  • SQL to get all journal titles from dc.source (55), since it’s apparently used for internal DSpace filename shit, but we moved all our journal titles there a few months ago:

    dspacetest=# select distinct text_value from metadatavalue where metadata_field_id=55 and text_value !~ '.*(\.pdf|\.png|\.PDF|\.Pdf|\.JPEG|\.jpg|\.JPG|\.jpeg|\.xls|\.rtf|\.docx?|\.potx|\.dotx|\.eqa|\.tiff|\.mp4|\.mp3|\.gif|\.zip|\.txt|\.pptx|\.indd|\.PNG|\.bmp|\.exe|org\.dspace\.app\.mediafilter).*';
    -
    +
  • +

    2016-08-25

    +
  • Atmire suggested adding a missing bean to dspace/config/spring/api/atmire-cua.xml but it doesn’t help:

    ...
     Error creating bean with name 'MetadataStorageInfoService'
     ...
    -
    +
  • - +
  • Atmire sent an updated version of dspace/config/spring/api/atmire-cua.xml and now XMLUI starts but gives a null pointer exception:

    Java stacktrace: java.lang.NullPointerException
    -    at org.dspace.app.xmlui.aspect.statistics.Navigation.addOptions(Navigation.java:129)
    -    at org.dspace.app.xmlui.wing.AbstractWingTransformer.startElement(AbstractWingTransformer.java:228)
    -    at sun.reflect.GeneratedMethodAccessor126.invoke(Unknown Source)
    -    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -    at java.lang.reflect.Method.invoke(Method.java:606)
    -    at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
    -    at com.sun.proxy.$Proxy103.startElement(Unknown Source)
    -    at org.apache.cocoon.environment.internal.EnvironmentChanger.startElement(EnvironmentStack.java:140)
    -    at org.apache.cocoon.environment.internal.EnvironmentChanger.startElement(EnvironmentStack.java:140)
    -    at org.apache.cocoon.xml.AbstractXMLPipe.startElement(AbstractXMLPipe.java:94)
    +at org.dspace.app.xmlui.aspect.statistics.Navigation.addOptions(Navigation.java:129)
    +at org.dspace.app.xmlui.wing.AbstractWingTransformer.startElement(AbstractWingTransformer.java:228)
    +at sun.reflect.GeneratedMethodAccessor126.invoke(Unknown Source)
    +at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    +at java.lang.reflect.Method.invoke(Method.java:606)
    +at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
    +at com.sun.proxy.$Proxy103.startElement(Unknown Source)
    +at org.apache.cocoon.environment.internal.EnvironmentChanger.startElement(EnvironmentStack.java:140)
    +at org.apache.cocoon.environment.internal.EnvironmentChanger.startElement(EnvironmentStack.java:140)
    +at org.apache.cocoon.xml.AbstractXMLPipe.startElement(AbstractXMLPipe.java:94)
     ...
    -
    +
  • - +
  • Import the 47 CCAFS records to CGSpace, creating the SimpleArchiveFormat bundles and importing like:

    $ ./safbuilder.sh -c /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/3546.csv
     $ JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m" /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/3546 -s /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/SimpleArchiveFormat -m 3546.map
    -
    +
  • -

    2016-08-26

    + +
  • The dspace log had this:

    2016-08-26 20:48:05,040 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -                                                               org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
    -
    +
  • -

    2016-08-27

    diff --git a/docs/2016-09/index.html b/docs/2016-09/index.html index eab49c2c8..1b295f737 100644 --- a/docs/2016-09/index.html +++ b/docs/2016-09/index.html @@ -12,10 +12,11 @@ Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace We had been using DC=ILRI to determine whether a user was ILRI or not + It looks like we might be able to use OUs now, instead of DCs: - $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)" + " /> @@ -30,12 +31,13 @@ $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=or Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace We had been using DC=ILRI to determine whether a user was ILRI or not + It looks like we might be able to use OUs now, instead of DCs: - $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)" + "/> - + @@ -120,29 +122,27 @@ $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=or
  • Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors
  • Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace
  • We had been using DC=ILRI to determine whether a user was ILRI or not
  • -
  • It looks like we might be able to use OUs now, instead of DCs:
  • - + +
  • It looks like we might be able to use OUs now, instead of DCs:

    $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
    -
    +
  • + +
  • User who has been migrated to the root vs user still in the hierarchical structure:

    distinguishedName: CN=Last\, First (ILRI),OU=ILRI Kenya Employees,OU=ILRI Kenya,OU=ILRIHUB,DC=CGIARAD,DC=ORG
     distinguishedName: CN=Last\, First (ILRI),OU=ILRI Ethiopia Employees,OU=ILRI Ethiopia,DC=ILRI,DC=CGIARAD,DC=ORG
    -
    +
  • -

    DSpace groups based on LDAP DN

    +
  • Notes for local PostgreSQL database recreation from production snapshot:

    $ dropdb dspacetest
     $ createdb -O dspacetest --encoding=UNICODE dspacetest
    @@ -151,96 +151,83 @@ $ pg_restore -O -U dspacetest -d dspacetest ~/Downloads/cgspace_2016-09-01.backu
     $ psql dspacetest -c 'alter user dspacetest nocreateuser;'
     $ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest -h localhost
     $ vacuumdb dspacetest
    -
    +
  • - +
  • Some names that I thought I fixed in July seem not to be:

    dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %';
    -      text_value       |              authority               | confidence
    +  text_value       |              authority               | confidence
     -----------------------+--------------------------------------+------------
    - Poole, Elizabeth Jane | b6efa27f-8829-4b92-80fe-bc63e03e3ccb |        600
    - Poole, Elizabeth Jane | 41628f42-fc38-4b38-b473-93aec9196326 |        600
    - Poole, Elizabeth Jane | 83b82da0-f652-4ebc-babc-591af1697919 |        600
    - Poole, Elizabeth Jane | c3a22456-8d6a-41f9-bba0-de51ef564d45 |        600
    - Poole, E.J.           | c3a22456-8d6a-41f9-bba0-de51ef564d45 |        600
    - Poole, E.J.           | 0fbd91b9-1b71-4504-8828-e26885bf8b84 |        600
    +Poole, Elizabeth Jane | b6efa27f-8829-4b92-80fe-bc63e03e3ccb |        600
    +Poole, Elizabeth Jane | 41628f42-fc38-4b38-b473-93aec9196326 |        600
    +Poole, Elizabeth Jane | 83b82da0-f652-4ebc-babc-591af1697919 |        600
    +Poole, Elizabeth Jane | c3a22456-8d6a-41f9-bba0-de51ef564d45 |        600
    +Poole, E.J.           | c3a22456-8d6a-41f9-bba0-de51ef564d45 |        600
    +Poole, E.J.           | 0fbd91b9-1b71-4504-8828-e26885bf8b84 |        600
     (6 rows)
    -
    +
  • - +
  • At least a few of these actually have the correct ORCID, but I will unify the authority to be c3a22456-8d6a-41f9-bba0-de51ef564d45

    dspacetest=# update metadatavalue set authority='c3a22456-8d6a-41f9-bba0-de51ef564d45', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %';
     UPDATE 69
    -
    +
  • - +
  • And for Peter Ballantyne:

    dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %';
    -    text_value     |              authority               | confidence
    +text_value     |              authority               | confidence
     -------------------+--------------------------------------+------------
    - Ballantyne, Peter | 2dcbcc7b-47b0-4fd7-bef9-39d554494081 |        600
    - Ballantyne, Peter | 4f04ca06-9a76-4206-bd9c-917ca75d278e |        600
    - Ballantyne, P.G.  | 4f04ca06-9a76-4206-bd9c-917ca75d278e |        600
    - Ballantyne, Peter | ba5f205b-b78b-43e5-8e80-0c9a1e1ad2ca |        600
    - Ballantyne, Peter | 20f21160-414c-4ecf-89ca-5f2cb64e75c1 |        600
    +Ballantyne, Peter | 2dcbcc7b-47b0-4fd7-bef9-39d554494081 |        600
    +Ballantyne, Peter | 4f04ca06-9a76-4206-bd9c-917ca75d278e |        600
    +Ballantyne, P.G.  | 4f04ca06-9a76-4206-bd9c-917ca75d278e |        600
    +Ballantyne, Peter | ba5f205b-b78b-43e5-8e80-0c9a1e1ad2ca |        600
    +Ballantyne, Peter | 20f21160-414c-4ecf-89ca-5f2cb64e75c1 |        600
     (5 rows)
    -
    +
  • - +
  • Again, a few have the correct ORCID, but there should only be one authority…

    dspacetest=# update metadatavalue set authority='4f04ca06-9a76-4206-bd9c-917ca75d278e', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %';
     UPDATE 58
    -
    +
  • - +
  • And for me:

    dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Orth, A%';
    - text_value |              authority               | confidence
    +text_value |              authority               | confidence
     ------------+--------------------------------------+------------
    - Orth, Alan | 4884def0-4d7e-4256-9dd4-018cd60a5871 |        600
    - Orth, A.   | 4884def0-4d7e-4256-9dd4-018cd60a5871 |        600
    - Orth, A.   | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
    +Orth, Alan | 4884def0-4d7e-4256-9dd4-018cd60a5871 |        600
    +Orth, A.   | 4884def0-4d7e-4256-9dd4-018cd60a5871 |        600
    +Orth, A.   | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
     (3 rows)
     dspacetest=# update metadatavalue set authority='1a1943a0-3f87-402f-9afe-e52fb46a513e', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Orth, %';
     UPDATE 11
    -
    +
  • - +
  • And for CCAFS author Bruce Campbell that I had discussed with CCAFS earlier this week:

    dspacetest=# update metadatavalue set authority='0e414b4c-4671-4a23-b570-6077aca647d8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Campbell, B%';
     UPDATE 166
     dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Campbell, B%';
    -       text_value       |              authority               | confidence
    +   text_value       |              authority               | confidence
     ------------------------+--------------------------------------+------------
    - Campbell, Bruce        | 0e414b4c-4671-4a23-b570-6077aca647d8 |        600
    - Campbell, Bruce Morgan | 0e414b4c-4671-4a23-b570-6077aca647d8 |        600
    - Campbell, B.           | 0e414b4c-4671-4a23-b570-6077aca647d8 |        600
    - Campbell, B.M.         | 0e414b4c-4671-4a23-b570-6077aca647d8 |        600
    +Campbell, Bruce        | 0e414b4c-4671-4a23-b570-6077aca647d8 |        600
    +Campbell, Bruce Morgan | 0e414b4c-4671-4a23-b570-6077aca647d8 |        600
    +Campbell, B.           | 0e414b4c-4671-4a23-b570-6077aca647d8 |        600
    +Campbell, B.M.         | 0e414b4c-4671-4a23-b570-6077aca647d8 |        600
     (4 rows)
    -
    +
  • -

    2016-09-05

    +
  • After one week of logging TLS connections on CGSpace:

    # zgrep "DES-CBC3" /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
     217
    @@ -249,18 +236,16 @@ dspacetest=# select distinct text_value, authority, confidence from metadatavalu
     # zgrep "DES-CBC3" /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | awk '{print $6}' | sort | uniq
     TLSv1/DES-CBC3-SHA
     TLSv1/EDH-RSA-DES-CBC3-SHA
    -
    +
  • - +
  • So this represents 0.02% of 1.16M connections over a one-week period

  • + +
  • Transforming some filenames in OpenRefine so they can have a useful description for SAFBuilder:

    value + "__description:" + cells["dc.type"].value
    -
    +
  • -

    2016-09-06

    @@ -283,28 +268,31 @@ TLSv1/EDH-RSA-DES-CBC3-SHA
  • See: http://www.fileformat.info/info/unicode/char/e1/index.htm
  • See: http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%A1&s=&uv=0
  • If I unzip the original zip from CIAT on Windows, re-zip it with 7zip on Windows, and then unzip it on Linux directly, the file names seem to be proper UTF-8
  • -
  • We should definitely clean filenames so they don’t use characters that are tricky to process in CSV and shell scripts, like: ,, ', and "
  • - + +
  • We should definitely clean filenames so they don’t use characters that are tricky to process in CSV and shell scripts, like: ,, ', and "

    value.replace("'","").replace(",","").replace('"','')
    -
    +
  • - + +
  • Import CIAT Gender Network records to CGSpace, first creating the SAF bundles as my user, then importing as the tomcat7 user, and deleting the bundle, for each collection’s items:

    $ ./safbuilder.sh -c /home/aorth/ciat-gender-2016-09-06/66601.csv
     $ JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m" /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/66601 -s /home/aorth/ciat-gender-2016-09-06/SimpleArchiveFormat -m 66601.map
     $ rm -rf ~/ciat-gender-2016-09-06/SimpleArchiveFormat/
    -
    +
  • +

    2016-09-07

    @@ -313,132 +301,117 @@ $ rm -rf ~/ciat-gender-2016-09-06/SimpleArchiveFormat/
  • Reading about PostgreSQL maintenance and it seems manual vacuuming is only for certain workloads, such as heavy update/write loads
  • I suggest we disable our nightly manual vacuum task, as we’re a mostly read workload, and I’d rather stick as close to the documentation as possible since we haven’t done any testing/observation of PostgreSQL
  • See: https://www.postgresql.org/docs/9.3/static/routine-vacuuming.html
  • -
  • CGSpace went down and the error seems to be the same as always (lately):
  • - + +
  • CGSpace went down and the error seems to be the same as always (lately):

    2016-09-07 11:39:23,162 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
     org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
     ...
    -
    +
  • -

    2016-09-13

    +
  • CGSpace crashed twice today, errors from catalina.out:

    org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
    -        at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:114)
    -
    + at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:114) +
  • -

    2016-09-14

    +
  • CGSpace crashed again, errors from catalina.out:

    org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
    -        at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:114)
    -
    + at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:114) +
  • - +
  • I restarted Tomcat and it was ok again

  • + +
  • CGSpace crashed a few hours later, errors from catalina.out:

    Exception in thread "http-bio-127.0.0.1-8081-exec-25" java.lang.OutOfMemoryError: Java heap space
    -        at java.lang.StringCoding.decode(StringCoding.java:215)
    -
    + at java.lang.StringCoding.decode(StringCoding.java:215) +
  • - +
  • We haven’t seen that in quite a while…

  • + +
  • Indeed, in a month of logs it only occurs 15 times:

    # grep -rsI "OutOfMemoryError" /var/log/tomcat7/catalina.* | wc -l
     15
    -
    +
  • - +
  • I also see a bunch of errors from dspace.log:

    2016-09-14 12:23:07,981 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
     org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
    -
    +
  • - +
  • Looking at REST requests, it seems there is one IP hitting us nonstop:

    # awk '{print $1}' /var/log/nginx/rest.log  | sort -n | uniq -c | sort -h | tail -n 3
    -    820 50.87.54.15
    -  12872 70.32.99.142
    -  25744 70.32.83.92
    +820 50.87.54.15
    +12872 70.32.99.142
    +25744 70.32.83.92
     # awk '{print $1}' /var/log/nginx/rest.log.1  | sort -n | uniq -c | sort -h | tail -n 3
    -   7966 181.118.144.29
    -  54706 70.32.99.142
    - 109412 70.32.83.92
    -
    +7966 181.118.144.29 +54706 70.32.99.142 +109412 70.32.83.92 +
  • - +
  • Those are the same IPs that were hitting us heavily in July, 2016 as well…

  • + +
  • I think the stability issues are definitely from REST

  • + +
  • Crashed AGAIN, errors from dspace.log:

    2016-09-14 14:31:43,069 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
     org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
    -
    +
  • - +
  • And more heap space errors:

    # grep -rsI "OutOfMemoryError" /var/log/tomcat7/catalina.* | wc -l
     19
    -
    +
  • - +
  • There are no more rest requests since the last crash, so maybe there are other things causing this.

  • + +
  • Hmm, I noticed a shitload of IPs from 180.76.0.0/16 are connecting to both CGSpace and DSpace Test (58 unique IPs concurrently!)

  • + +
  • They seem to be coming from Baidu, and so far during today alone account for 16 of every connection:

    # grep -c ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-14
     29084
     # grep -c ip_addr=180.76.15 /home/cgspace.cgiar.org/log/dspace.log.2016-09-14
     5192
    -
    +
  • - +
  • Other recent days are the same… hmmm.

  • + +
  • From the activity control panel I can see 58 unique IPs hitting the site concurrently, which has GOT to hurt our stability

  • + +
  • A list of all 2000 unique IPs from CGSpace logs today:

    # grep ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-11 | awk -F: '{print $5}' | sort -n | uniq -c | sort -h | tail -n 100
    -
    +
  • - +
  • Looking at the top 20 IPs or so, most are Yahoo, MSN, Google, Baidu, TurnitIn (iParadigm), etc… do we have any real users?

  • + +
  • Generate a list of all author affiliations for Peter Ballantyne to go through, make corrections, and create a lookup list from:

    dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=211 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
    -
    +
  • - +
  • Looking into the Catalina logs again around the time of the first crash, I see:

    Wed Sep 14 09:47:27 UTC 2016 | Query:id: 78581 AND type:2
     Wed Sep 14 09:47:28 UTC 2016 | Updating : 6/6 docs.
    @@ -446,12 +419,11 @@ Commit
     Commit done
     dn:CN=Haman\, Magdalena  (CIAT-CCAFS),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org
     Exception in thread "http-bio-127.0.0.1-8081-exec-193" java.lang.OutOfMemoryError: Java heap space
    -
    +
  • - +
  • And after that I see a bunch of “pool error Timeout waiting for idle object”

  • + +
  • Later, near the time of the next crash I see:

    dn:CN=Haman\, Magdalena  (CIAT-CCAFS),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org
     Wed Sep 14 11:29:55 UTC 2016 | Query:id: 79078 AND type:2
    @@ -462,27 +434,24 @@ Sep 14, 2016 11:32:22 AM com.sun.jersey.server.wadl.generators.WadlGeneratorJAXB
     SEVERE: Failed to generate the schema for the JAX-B elements
     com.sun.xml.bind.v2.runtime.IllegalAnnotationsException: 2 counts of IllegalAnnotationExceptions
     java.util.Map is an interface, and JAXB can't handle interfaces.
    -        this problem is related to the following location:
    -                at java.util.Map
    -                at public java.util.Map com.atmire.dspace.rest.common.Statlet.getRender()
    -                at com.atmire.dspace.rest.common.Statlet
    +    this problem is related to the following location:
    +            at java.util.Map
    +            at public java.util.Map com.atmire.dspace.rest.common.Statlet.getRender()
    +            at com.atmire.dspace.rest.common.Statlet
     java.util.Map does not have a no-arg default constructor.
    -        this problem is related to the following location:
    -                at java.util.Map
    -                at public java.util.Map com.atmire.dspace.rest.common.Statlet.getRender()
    -                at com.atmire.dspace.rest.common.Statlet
    -
    + this problem is related to the following location: + at java.util.Map + at public java.util.Map com.atmire.dspace.rest.common.Statlet.getRender() + at com.atmire.dspace.rest.common.Statlet +
  • - +
  • Then 20 minutes later another outOfMemoryError:

    Exception in thread "http-bio-127.0.0.1-8081-exec-25" java.lang.OutOfMemoryError: Java heap space
    -        at java.lang.StringCoding.decode(StringCoding.java:215)
    -
    + at java.lang.StringCoding.decode(StringCoding.java:215) +
  • -

    Tomcat JVM usage day @@ -492,15 +461,15 @@ java.util.Map does not have a no-arg default constructor.

    + +
  • Seems we added a bunch of settings to the /etc/default/tomcat7 in December, 2015 and never updated our ansible repository:

    JAVA_OPTS="-Djava.awt.headless=true -Xms3584m -Xmx3584m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -XX:-UseGCOverheadLimit -XX:MaxGCPauseMillis=250 -XX:GCTimeRatio=9 -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages -XX:+AggressiveOpts"
    -
    +
  • -

    2016-09-15

    @@ -514,8 +483,7 @@ java.util.Map does not have a no-arg default constructor.

    2016-09-16

    +
  • CGSpace crashed again, and there are TONS of heap space errors but the datestamps aren’t on those lines so I’m not sure if they were yesterday:

    dn:CN=Orentlicher\, Natalie (CIAT),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org
     Thu Sep 15 18:45:25 UTC 2016 | Query:id: 55785 AND type:2
    @@ -533,41 +501,38 @@ Exception in thread "http-bio-127.0.0.1-8081-exec-263" java.lang.OutOf
     Exception in thread "http-bio-127.0.0.1-8081-exec-280" java.lang.OutOfMemoryError: Java heap space
     Exception in thread "Thread-54216" org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Exception writing document id 7feaa95d-8e1f-4f45-80bb
     -e14ef82ee224 to the index; possible analysis error.
    -        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
    -        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
    -        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
    -        at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
    -        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
    -        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:102)
    -        at com.atmire.statistics.SolrLogThread.run(SourceFile:25)
    -
    + at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552) + at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) + at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) + at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124) + at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116) + at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:102) + at com.atmire.statistics.SolrLogThread.run(SourceFile:25) +
  • - +
  • I bumped the heap space from 4096m to 5120m to see if this is really about heap speace or not.

  • + +
  • Looking into some of these errors that I’ve seen this week but haven’t noticed before:

    # zcat -f -- /var/log/tomcat7/catalina.* | grep -c 'Failed to generate the schema for the JAX-B elements'
     113
    -
    +
  • -

    2016-09-19

    +
  • Work on cleanups for author affiliations after Peter sent me his list of corrections/deletions:

    $ ./fix-metadata-values.py -i affiliations_pb-322-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p fuuu
     $ ./delete-metadata-values.py -f cg.contributor.affiliation -i affiliations_pb-2-deletions.csv -m 211 -u dspace -d dspace -p fuuu
    -
    +
  • -

    2016-09-20

    @@ -587,42 +552,42 @@ $ ./delete-metadata-values.py -f cg.contributor.affiliation -i affiliations_pb-2 + +
  • We just need to set this in dspace/solr/search/conf/schema.xml:

    <solrQueryParser defaultOperator="AND"/>
    -
    +
  • -

    CGSpace search with "OR" boolean logic DSpace Test search with "AND" boolean logic

    +
  • Found a way to improve the configuration of Atmire’s Content and Usage Analysis (CUA) module for date fields

    -content.analysis.dataset.option.8=metadata:dateAccessioned:discovery
     +content.analysis.dataset.option.8=metadata:dc.date.accessioned:date(month)
    -
    +
  • - +
  • This allows the module to treat the field as a date rather than a text string, so we can interrogate it more intelligently

  • + +
  • Add dc.date.accessioned to XMLUI Discovery search filters

  • + +
  • Major CGSpace crash because ILRI forgot to pay the Linode bill

  • + +
  • 45 minutes of downtime!

  • + +
  • Start processing the fixes to dc.description.sponsorship from Peter Ballantyne:

    $ ./fix-metadata-values.py -i sponsors-fix-23.csv -f dc.description.sponsorship -t correct -m 29 -d dspace -u dspace -p fuuu
     $ ./delete-metadata-values.py -i sponsors-delete-8.csv -f dc.description.sponsorship -m 29 -d dspace -u dspace -p fuuu
    -
    +
  • -

    2016-09-22

    @@ -639,18 +604,19 @@ $ ./delete-metadata-values.py -i sponsors-delete-8.csv -f dc.description.sponsor
  • Merge updates to sponsorship controlled vocabulary (#277)
  • I’ve been trying to add a search filter for dc.description so the IITA people can search for some tags they use there, but for some reason the filter never shows up in Atmire’s CUA
  • Not sure if it’s something like we already have too many filters there (30), or the filter name is reserved, etc…
  • -
  • Generate a list of ILRI subjects for Peter and Abenet to look through/fix:
  • - + +
  • Generate a list of ILRI subjects for Peter and Abenet to look through/fix:

    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where resource_type_id=2 and metadata_field_id=203 group by text_value order by count desc) to /tmp/ilrisubjects.csv with csv;
    -
    +
  • - +
  • Regenerate Discovery indexes a few times after playing with discovery.xml index definitions (syntax, parameters, etc).

  • + +
  • Merge changes to boolean logic in Solr search (#274)

  • + +
  • Run all sponsorship and affiliation fixes on CGSpace, deploy latest 5_x-prod branch, and re-index Discovery on CGSpace

  • + +
  • Tested OCSP stapling on DSpace Test’s nginx and it works:

    $ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgiar.org -tls1_2 -tlsextdebug -status
     ...
    @@ -658,48 +624,48 @@ OCSP response:
     ======================================
     OCSP Response Data:
     ...
    -    Cert Status: good
    -
    +Cert Status: good +
  • -

    2016-09-27

    + +
  • This author has a few variations:

    dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeu
     len, S%';
    -
    +
  • - +
  • And it looks like fe4b719f-6cc4-4d65-8504-7a83130b9f83 is the authority with the correct ORCID linked

    dspacetest=# update metadatavalue set authority='fe4b719f-6cc4-4d65-8504-7a83130b9f83w', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
     UPDATE 101
    -
    +
  • - +
  • Hmm, now her name is missing from the authors facet and only shows the authority ID

  • + +
  • On the production server there is an item with her ORCID but it is using a different authority: f01f7b7b-be3f-4df7-a61d-b73c067de88d

  • + +
  • Maybe I used the wrong one… I need to look again at the production database

  • + +
  • On a clean snapshot of the database I see the correct authority should be f01f7b7b-be3f-4df7-a61d-b73c067de88d, not fe4b719f-6cc4-4d65-8504-7a83130b9f83

  • + +
  • Updating her authorities again and reindexing:

    dspacetest=# update metadatavalue set authority='f01f7b7b-be3f-4df7-a61d-b73c067de88d', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
     UPDATE 101
    -
    +
  • -

    2016-09-28

    @@ -711,22 +677,23 @@ UPDATE 101
  • Going to try to update Sonja Vermeulen’s authority to 2b4166b7-6e4d-4f66-9d8b-ddfbec9a6ae0, as that seems to be one of her authorities that has an ORCID
  • Merge Font Awesome changes (#279)
  • Minor fix to a string in Atmire’s CUA module (#280)
  • -
  • This seems to be what I’ll need to do for Sonja Vermeulen (but with 2b4166b7-6e4d-4f66-9d8b-ddfbec9a6ae0 instead on the live site):
  • - + +
  • This seems to be what I’ll need to do for Sonja Vermeulen (but with 2b4166b7-6e4d-4f66-9d8b-ddfbec9a6ae0 instead on the live site):

    dspacetest=# update metadatavalue set authority='09e4da69-33a3-45ca-b110-7d3f82d2d6d2', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
     dspacetest=# update metadatavalue set authority='09e4da69-33a3-45ca-b110-7d3f82d2d6d2', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen SJ%';
    -
    +
  • - +
  • And then update Discovery and Authority indexes

  • + +
  • Minor fix for “Subject” string in Discovery search and Atmire modules (#281)

  • + +
  • Start testing batch fixes for ILRI subject from Peter:

    $ ./fix-metadata-values.py -i ilrisubjects-fix-32.csv -f cg.subject.ilri -t correct -m 203 -d dspace -u dspace -p fuuuu
     $ ./delete-metadata-values.py -i ilrisubjects-delete-13.csv -f cg.subject.ilri -m 203 -d dspace -u dspace -p fuuu
    -
    +
  • +

    2016-09-29

    @@ -734,11 +701,12 @@ $ ./delete-metadata-values.py -i ilrisubjects-delete-13.csv -f cg.subject.ilri -
  • Add cg.identifier.ciatproject to metadata registry in preparation for CIAT project tag
  • Merge changes for CIAT project tag (#282)
  • DSpace Test (linode02) became unresponsive for some reason, I had to hard reboot it from the Linode console
  • -
  • People on DSpace mailing list gave me a query to get authors from certain collections:
  • - + +
  • People on DSpace mailing list gave me a query to get authors from certain collections:

    dspacetest=# select distinct text_value from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/5472', '10568/5473')));
    -
    +
  • +

    2016-09-30

    diff --git a/docs/2016-10/index.html b/docs/2016-10/index.html index 25950c863..8a1b4e4b0 100644 --- a/docs/2016-10/index.html +++ b/docs/2016-10/index.html @@ -16,10 +16,11 @@ Need to test the following scenarios to see how author order is affected: ORCIDs only ORCIDs plus normal authors + I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry: - 0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X + " /> @@ -38,12 +39,13 @@ Need to test the following scenarios to see how author order is affected: ORCIDs only ORCIDs plus normal authors + I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry: - 0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X + "/> - + @@ -132,11 +134,12 @@ I exported a random item’s metadata as CSV, deleted all columns except id
  • ORCIDs only
  • ORCIDs plus normal authors
  • -
  • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
  • - + +
  • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:

    0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
    -
    +
  • + -
  • That left us with 3,180 valid corrections and 3 deletions:
  • - + +
  • That left us with 3,180 valid corrections and 3 deletions:

    $ ./fix-metadata-values.py -i authors-fix-3180.csv -f dc.contributor.author -t correct -m 3 -d dspacetest -u dspacetest -p fuuu
     $ ./delete-metadata-values.py -i authors-delete-3.csv -f dc.contributor.author -m 3 -d dspacetest -u dspacetest -p fuuu
    -
    +
  • - +
  • Remove old about page (#284)

  • + +
  • CGSpace crashed a few times today

  • + +
  • Generate list of unique authors in CCAFS collections:

    dspacetest=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/32729', '10568/5472', '10568/5473', '10568/10288', '10568/70974', '10568/3547', '10568/3549', '10568/3531','10568/16890','10568/5470','10568/3546', '10568/36024', '10568/66581', '10568/21789', '10568/5469', '10568/5468', '10568/3548', '10568/71053', '10568/25167'))) group by text_value order by count desc) to /tmp/ccafs-authors.csv with csv;
    -
    +
  • +

    2016-10-05

    @@ -203,24 +207,22 @@ $ ./delete-metadata-values.py -i authors-delete-3.csv -f dc.contributor.author - + +
  • Run fixes for ILRI subjects and delete blank metadata values:

    dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
     DELETE 11
    -
    +
  • - +
  • Run all system updates and reboot CGSpace

  • + +
  • Delete ten gigs of old 2015 Tomcat logs that never got rotated (WTF?):

    root@linode01:~# ls -lh /var/log/tomcat7/localhost_access_log.2015* | wc -l
     47
    -
    +
  • -

    2016-10-14

    @@ -234,34 +236,34 @@ DELETE 11

    2016-10-17

    +
  • A bit more cleanup on the CCAFS authors, and run the corrections on DSpace Test:

    $ ./fix-metadata-values.py -i ccafs-authors-oct-16.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu
    -
    +
  • -

    2016-10-18

    +
  • Start working on DSpace 5.5 porting work again:

    $ git checkout -b 5_x-55 5_x-prod
     $ git rebase -i dspace-5.5
    -
    +
  • -

    2016-10-19

    @@ -286,38 +288,31 @@ $ git rebase -i dspace-5.5

    2016-10-25

    +
  • Move the LIVES community from the top level to the ILRI projects community

    $ /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/27629 --child=10568/25101
    -
    +
  • - +
  • Start testing some things for DSpace 5.5, like command line metadata import, PDF media filter, and Atmire CUA

  • + +
  • Start looking at batch fixing of “old” ILRI website links without www or https, for example:

    dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ilri.org%';
    -
    +
  • - +
  • Also CCAFS has HTTPS and their links should use it where possible:

    dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ccafs.cgiar.org%';
    -
    +
  • - +
  • And this will find community and collection HTML text that is using the old style PNG/JPG icons for RSS and email (we should be using Font Awesome icons instead):

    dspace=# select text_value from metadatavalue where resource_type_id in (3,4) and text_value like '%Iconrss2.png%';
    -
    +
  • - +
  • Turns out there are shit tons of varieties of this, like with http, https, www, separate </img> tags, alignments, etc

  • + +
  • Had to find all variations and replace them individually:

    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/Iconrss2.png"/>','<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/Iconrss2.png"/>%';
     dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/email.jpg"/>%';
    @@ -335,19 +330,19 @@ dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<i
     dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="https://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="https://www.ilri.org/images/email.jpg"/>%';
     dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="http://www.ilri.org/images/Iconrss2.png"/>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="http://www.ilri.org/images/Iconrss2.png"/>%';
     dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="http://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="http://www.ilri.org/images/email.jpg"/>%';
    -
    +
  • -

    2016-10-27

    +
  • Run Font Awesome fixes on DSpace Test:

    dspace=# \i /tmp/font-awesome-text-replace.sql
     UPDATE 17
    @@ -367,10 +362,9 @@ UPDATE 1
     UPDATE 1
     UPDATE 1
     UPDATE 0
    -
    +
  • -

    CGSpace with old icons @@ -383,53 +377,47 @@ UPDATE 0

    2016-10-30

    +
  • Fix some messed up authors on CGSpace:

    dspace=# update metadatavalue set authority='799da1d8-22f3-43f5-8233-3d2ef5ebf8a8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Charleston, B.%';
     UPDATE 10
     dspace=# update metadatavalue set authority='e936f5c5-343d-4c46-aa91-7a1fff6277ed', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Knight-Jones%';
     UPDATE 36
    -
    +
  • - +
  • I updated the authority index but nothing seemed to change, so I’ll wait and do it again after I update Discovery below

  • + +
  • Skype chat with Tsega about the IFPRI contentdm bridge

  • + +
  • We tested harvesting OAI in an example collection to see how it works

  • + +
  • Talk to Carlos Quiros about CG Core metadata in CGSpace

  • + +
  • Get a list of countries from CGSpace so I can do some batch corrections:

    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=228 group by text_value order by count desc) to /tmp/countries.csv with csv;
    -
    +
  • - +
  • Fix a bunch of countries in Open Refine and run the corrections on CGSpace:

    $ ./fix-metadata-values.py -i countries-fix-18.csv -f dc.coverage.country -t 'correct' -m 228 -d dspace -u dspace -p fuuu
     $ ./delete-metadata-values.py -i countries-delete-2.csv -f dc.coverage.country -m 228 -d dspace -u dspace -p fuuu
    -
    +
  • - +
  • Run a shit ton of author fixes from Peter Ballantyne that we’ve been cleaning up for two months:

    $ ./fix-metadata-values.py -i /tmp/authors-fix-pb2.csv -f dc.contributor.author -t correct -m 3 -u dspace -d dspace -p fuuu
    -
    +
  • - +
  • Run a few URL corrections for ilri.org and doi.org, etc:

    dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://www.ilri.org','https://www.ilri.org') where resource_type_id=2 and text_value like '%http://www.ilri.org%';
     dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://mahider.ilri.org', 'https://cgspace.cgiar.org') where resource_type_id=2 and text_value like '%http://mahider.%.org%' and metadata_field_id not in (28);
     dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://dx.doi.org') where resource_type_id=2 and text_value like '%http://dx.doi.org%' and metadata_field_id not in (18,26,28,111);
     dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://doi.org', 'https://dx.doi.org') where resource_type_id=2 and text_value like '%http://doi.org%' and metadata_field_id not in (18,26,28,111);
    -
    +
  • - diff --git a/docs/2016-11/index.html b/docs/2016-11/index.html index 8f342f85f..c2c57e3e2 100644 --- a/docs/2016-11/index.html +++ b/docs/2016-11/index.html @@ -27,7 +27,7 @@ Add dc.type to the output options for Atmire’s Listings and Reports module "/> - + @@ -121,8 +121,8 @@ Add dc.type to the output options for Atmire’s Listings and Reports module
  • Run all updates on DSpace Test and reboot the server
  • Looks like the OAI bug from DSpace 5.1 that caused validation at Base Search to fail is now fixed and DSpace Test passes validation! (#63)
  • Indexing Discovery on DSpace Test took 332 minutes, which is like five times as long as it usually takes
  • -
  • At the end it appeared to finish correctly but there were lots of errors right after it finished:
  • - + +
  • At the end it appeared to finish correctly but there were lots of errors right after it finished:

    2016-11-02 15:09:48,578 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76454 to Index
     2016-11-02 15:09:48,584 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Wrote Community: 10568/3202 to Index
    @@ -134,27 +134,25 @@ Add dc.type to the output options for Atmire’s Listings and Reports module
     2016-11-02 15:09:48,616 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76457 to Index
     2016-11-02 15:09:48,634 ERROR com.atmire.dspace.discovery.AtmireSolrService @
     java.lang.NullPointerException
    -        at org.dspace.discovery.SearchUtils.getDiscoveryConfiguration(SourceFile:57)
    -        at org.dspace.discovery.SolrServiceImpl.buildDocument(SolrServiceImpl.java:824)
    -        at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:821)
    -        at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:898)
    -        at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370)
    -        at org.dspace.storage.rdbms.DatabaseUtils$ReindexerThread.run(DatabaseUtils.java:945)
    -
    + at org.dspace.discovery.SearchUtils.getDiscoveryConfiguration(SourceFile:57) + at org.dspace.discovery.SolrServiceImpl.buildDocument(SolrServiceImpl.java:824) + at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:821) + at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:898) + at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370) + at org.dspace.storage.rdbms.DatabaseUtils$ReindexerThread.run(DatabaseUtils.java:945) +
  • - +
  • DSpace is still up, and a few minutes later I see the default DSpace indexer is still running

  • + +
  • Sure enough, looking back before the first one finished, I see output from both indexers interleaved in the log:

    2016-11-02 15:09:28,545 INFO  org.dspace.discovery.SolrServiceImpl @ Wrote Item: 10568/47242 to Index
     2016-11-02 15:09:28,633 INFO  org.dspace.discovery.SolrServiceImpl @ Wrote Item: 10568/60785 to Index
     2016-11-02 15:09:28,678 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Processing (55695 of 55722): 43557
     2016-11-02 15:09:28,688 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Processing (55703 of 55722): 34476
    -
    +
  • -

    2016-11-06

    @@ -166,30 +164,25 @@ java.lang.NullPointerException

    2016-11-07

    +
  • Horrible one liner to get Linode ID from certain Ansible host vars:

    $ grep -A 3 contact_info * | grep -E "(Orth|Sisay|Peter|Daniel|Tsega)" | awk -F'-' '{print $1}' | grep linode | uniq | xargs grep linode_id
    -
    +
  • - +
  • I noticed some weird CRPs in the database, and they don’t show up in Discovery for some reason, perhaps the :

  • + +
  • I’ll export these and fix them in batch:

    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=230 group by text_value order by count desc) to /tmp/crp.csv with csv;
     COPY 22
    -
    +
  • - +
  • Test running the replacements:

    $ ./fix-metadata-values.py -i /tmp/CRPs.csv -f cg.contributor.crp -t correct -m 230 -d dspace -u dspace -p 'fuuu'
    -
    +
  • -

    2016-11-08

    @@ -207,11 +200,12 @@ COPY 22
  • All records in the authority core are either authority_type:orcid or authority_type:person
  • There is a deleted field and all items seem to be false, but might be important sanity check to remember
  • The way to go is probably to have a CSV of author names and authority IDs, then to batch update them in PostgreSQL
  • -
  • Dump of the top ~200 authors in CGSpace:
  • - + +
  • Dump of the top ~200 authors in CGSpace:

    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=3 group by text_value order by count desc limit 210) to /tmp/210-authors.csv with csv;
    -
    +
  • +

    2016-11-09

    @@ -225,128 +219,114 @@ COPY 22 + +
  • Playing with find-by-metadata-field, this works:

    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}'
    -
    +
  • - +
  • But the results are deceiving because metadata fields can have text languages and your query must match exactly!

    dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
    - text_value | text_lang
    +text_value | text_lang
     ------------+-----------
    - SEEDS      |
    - SEEDS      |
    - SEEDS      | en_US
    +SEEDS      |
    +SEEDS      |
    +SEEDS      | en_US
     (3 rows)
    -
    +
  • - +
  • So basically, the text language here could be null, blank, or en_US

  • + +
  • To query metadata with these properties, you can do:

    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}' | jq length
     55
     $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":""}' | jq length
     34
     $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":"en_US"}' | jq length
    -
    +
  • - +
  • The results (55+34=89) don’t seem to match those from the database:

    dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang is null;
    - count
    +count
     -------
    -    15
    +15
     dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang='';
    - count
    +count
     -------
    -     4
    + 4
     dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang='en_US';
    - count
    +count
     -------
    -    66
    -
    +66 +
  • - +
  • So, querying from the API I get 55 + 34 = 89 results, but the database actually only has 85…

  • + +
  • And the find-by-metadata-field endpoint doesn’t seem to have a way to get all items with the field, or a wildcard value

  • + +
  • I’ll ask a question on the dspace-tech mailing list

  • + +
  • And speaking of text_lang, this is interesting:

    dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2;
    - text_lang
    +text_lang
     -----------
     
    - ethnob
    - en
    - spa
    - EN
    - es
    - frn
    - en_
    - en_US
    +ethnob
    +en
    +spa
    +EN
    +es
    +frn
    +en_
    +en_US
     
    - EN_US
    - eng
    - en_U
    - fr
    +EN_US
    +eng
    +en_U
    +fr
     (14 rows)
    -
    +
  • - +
  • Generate a list of all these so I can maybe fix them in batch:

    dspace=# \copy (select distinct text_lang, count(*) from metadatavalue where resource_type_id=2 group by text_lang order by count desc) to /tmp/text-langs.csv with csv;
     COPY 14
    -
    +
  • - +
  • Perhaps we need to fix them all in batch, or experiment with fixing only certain metadatavalues:

    dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
     UPDATE 85
    -
    +
  • - +
  • The fix-metadata.py script I have is meant for specific metadata values, so if I want to update some text_lang values I should just do it directly in the database

  • + +
  • For example, on a limited set:

    dspace=# update metadatavalue set text_lang=NULL where resource_type_id=2 and metadata_field_id=203 and text_value='LIVESTOCK' and text_lang='';
     UPDATE 420
    -
    +
  • - +
  • And assuming I want to do it for all fields:

    dspacetest=# update metadatavalue set text_lang=NULL where resource_type_id=2 and text_lang='';
     UPDATE 183726
    -
    +
  • - +
  • After that restarted Tomcat and PostgreSQL (because I’m superstitious about caches) and now I see the following in REST API query:

    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}' | jq length
     71
     $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":""}' | jq length
     0
     $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":"en_US"}' | jq length
    -
    +
  • -

    2016-11-14

    @@ -355,8 +335,8 @@ $ curl -s -H "accept: application/json" -H "Content-Type: applica
  • I applied Atmire’s suggestions to fix Listings and Reports for DSpace 5.5 and now it works
  • There were some issues with the dspace/modules/jspui/pom.xml, which is annoying because all I did was rebase our working 5.1 code on top of 5.5, meaning Atmire’s installation procedure must have changed
  • So there is apparently this Tomcat native way to limit web crawlers to one session: Crawler Session Manager
  • -
  • After adding that to server.xml bots matching the pattern in the configuration will all use ONE session, just like normal users:
  • - + +
  • After adding that to server.xml bots matching the pattern in the configuration will all use ONE session, just like normal users:

    $ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
     HTTP/1.1 200 OK
    @@ -383,11 +363,11 @@ Server: nginx
     Transfer-Encoding: chunked
     Vary: Accept-Encoding
     X-Cocoon-Version: 2.2.0
    -
    +
  • -

    2016-11-15

    @@ -400,8 +380,7 @@ X-Cocoon-Version: 2.2.0 Tomcat JVM heap (week) after setting up the Crawler Session Manager

    +
  • Seems the default regex doesn’t catch Baidu, though:

    $ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
     HTTP/1.1 200 OK
    @@ -428,20 +407,16 @@ Set-Cookie: JSESSIONID=F6403C084480F765ED787E41D2521903; Path=/; Secure; HttpOnl
     Transfer-Encoding: chunked
     Vary: Accept-Encoding
     X-Cocoon-Version: 2.2.0
    -
    +
  • - +
  • Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:

    <!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers -->
     <Valve className="org.apache.catalina.valves.CrawlerSessionManagerValve"
    -       crawlerUserAgents=".*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*" />
    -
    + crawlerUserAgents=".*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*" /> +
  • - +
  • Looking at the bots that were active yesterday it seems the above regex should be sufficient:

    $ grep -o -E 'Mozilla/5\.0 \(compatible;.*\"' /var/log/nginx/access.log | sort | uniq
     Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" "-"
    @@ -449,60 +424,54 @@ Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" &q
     Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
     Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" "-"
     Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)" "-"
    -
    +
  • - +
  • Neat maven trick to exclude some modules from being built:

    $ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=localhost -P \!dspace-lni,\!dspace-rdf,\!dspace-sword,\!dspace-swordv2 clean package
    -
    +
  • -

    2016-11-17

    +
  • Generate a list of journal titles for Peter and Abenet to look through so we can make a controlled vocabulary out of them:

    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc) to /tmp/journal-titles.csv with csv;
     COPY 2515
    -
    +
  • - +
  • Send a message to users of the CGSpace REST API to notify them of upcoming upgrade so they can test their apps against DSpace Test

  • + +
  • Test an update old, non-HTTPS links to the CCAFS website in CGSpace metadata:

    dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, 'http://ccafs.cgiar.org','https://ccafs.cgiar.org') where resource_type_id=2 and text_value like '%http://ccafs.cgiar.org%';
     UPDATE 164
     dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://ccafs.cgiar.org','https://ccafs.cgiar.org') where resource_type_id=2 and text_value like '%http://ccafs.cgiar.org%';
     UPDATE 7
    -
    +
  • - +
  • Had to run it twice to get all (not sure about “global” regex in PostgreSQL)

  • + +
  • Run the updates on CGSpace as well

  • + +
  • Run through some collections and manually regenerate some PDF thumbnails for items from before 2016 on DSpace Test to compare with CGSpace

  • + +
  • I’m debating forcing the re-generation of ALL thumbnails, since some come from DSpace 3 and 4 when the thumbnailing wasn’t as good

  • + +
  • The results were very good, I think that after we upgrade to 5.5 I will do it, perhaps one community / collection at a time:

    $ [dspace]/bin/dspace filter-media -f -i 10568/67156 -p "ImageMagick PDF Thumbnail"
    -
    +
  • - +
  • In related news, I’m looking at thumbnails of thumbnails (the ones we uploaded manually before, and now DSpace’s media filter has made thumbnails of THEM):

    dspace=# select text_value from metadatavalue where text_value like '%.jpg.jpg';
    -
    +
  • -

    2016-11-18

    @@ -566,8 +535,8 @@ UPDATE 7 + +
  • Looking at the Catlina logs I see there is some super long-running indexing process going on:

    INFO: FrameworkServlet 'oai': initialization completed in 2600 ms
     [>                                                  ] 0% time remaining: Calculating... timestamp: 2016-11-28 03:00:18
    @@ -577,32 +546,33 @@ UPDATE 7
     [>                                                  ] 0% time remaining: 14 hour(s) 5 minute(s) 56 seconds. timestamp: 2016-11-28 03:00:19
     [>                                                  ] 0% time remaining: 11 hour(s) 23 minute(s) 49 seconds. timestamp: 2016-11-28 03:00:19
     [>                                                  ] 0% time remaining: 11 hour(s) 21 minute(s) 57 seconds. timestamp: 2016-11-28 03:00:20
    -
    +
  • - +
  • It says OAI, and seems to start at 3:00 AM, but I only see the filter-media cron job set to start then

  • + +
  • Double checking the DSpace 5.x upgrade notes for anything I missed, or troubleshooting tips

  • + +
  • Running some manual processes just in case:

    $ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacetest.cgiar.org/config/registries/dcterms-types.xml
     $ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacetest.cgiar.org/config/registries/dublin-core-types.xml
     $ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacetest.cgiar.org/config/registries/eperson-types.xml
     $ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacetest.cgiar.org/config/registries/workflow-types.xml
    -
    +
  • -

    2016-11-29

    + +
  • Around the time of his login I see this in the DSpace logs:

    2016-11-29 07:56:36,350 INFO  org.dspace.authenticate.LDAPAuthentication @ g.cherinet@cgiar.org:session_id=F628E13AB4EF2BA949198A99EFD8EBE4:ip_addr=213.55.99.121:failed_login:no DN found for user g.cherinet@cgiar.org
     2016-11-29 07:56:36,350 INFO  org.dspace.authenticate.PasswordAuthentication @ g.cherinet@cgiar.org:session_id=F628E13AB4EF2BA949198A99EFD8EBE4:ip_addr=213.55.99.121:authenticate:attempting password auth of user=g.cherinet@cgiar.org
    @@ -615,30 +585,32 @@ $ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacete
     2016-11-29 07:56:36,701 INFO  org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ facets for scope, null: 23
     2016-11-29 07:56:36,747 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
     org.dspace.discovery.SearchServiceException: Error executing query
    -        at org.dspace.discovery.SolrServiceImpl.search(SolrServiceImpl.java:1618)
    -        at org.dspace.discovery.SolrServiceImpl.search(SolrServiceImpl.java:1600)
    -        at org.dspace.discovery.SolrServiceImpl.search(SolrServiceImpl.java:1583)
    -        at org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer.performSearch(SidebarFacetsTransformer.java:165)
    -        at org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer.addOptions(SidebarFacetsTransformer.java:174)
    -        at org.dspace.app.xmlui.wing.AbstractWingTransformer.startElement(AbstractWingTransformer.java:228)
    -        at sun.reflect.GeneratedMethodAccessor277.invoke(Unknown Source)
    +    at org.dspace.discovery.SolrServiceImpl.search(SolrServiceImpl.java:1618)
    +    at org.dspace.discovery.SolrServiceImpl.search(SolrServiceImpl.java:1600)
    +    at org.dspace.discovery.SolrServiceImpl.search(SolrServiceImpl.java:1583)
    +    at org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer.performSearch(SidebarFacetsTransformer.java:165)
    +    at org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer.addOptions(SidebarFacetsTransformer.java:174)
    +    at org.dspace.app.xmlui.wing.AbstractWingTransformer.startElement(AbstractWingTransformer.java:228)
    +    at sun.reflect.GeneratedMethodAccessor277.invoke(Unknown Source)
     ...
    -
    +
  • - +
  • At about the same time in the solr log I see a super long query:

    2016-11-29 07:56:36,734 INFO  org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=*:*&fl=dateIssued.year,handle,search.resourcetype,search.resourceid,search.uniqueid&start=0&fq=NOT(withdrawn:true)&fq=NOT(discoverable:false)&fq=dateIssued.year:[*+TO+*]&fq=read:(g0+OR+e574+OR+g0+OR+g3+OR+g9+OR+g10+OR+g14+OR+g16+OR+g18+OR+g20+OR+g23+OR+g24+OR+g2072+OR+g2074+OR+g28+OR+g2076+OR+g29+OR+g2078+OR+g2080+OR+g34+OR+g2082+OR+g2084+OR+g38+OR+g2086+OR+g2088+OR+g2091+OR+g43+OR+g2092+OR+g2093+OR+g2095+OR+g2097+OR+g50+OR+g2099+OR+g51+OR+g2103+OR+g62+OR+g65+OR+g2115+OR+g2117+OR+g2119+OR+g2121+OR+g2123+OR+g2125+OR+g77+OR+g78+OR+g79+OR+g2127+OR+g80+OR+g2129+OR+g2131+OR+g2133+OR+g2134+OR+g2135+OR+g2136+OR+g2137+OR+g2138+OR+g2139+OR+g2140+OR+g2141+OR+g2142+OR+g2148+OR+g2149+OR+g2150+OR+g2151+OR+g2152+OR+g2153+OR+g2154+OR+g2156+OR+g2165+OR+g2167+OR+g2171+OR+g2174+OR+g2175+OR+g129+OR+g2182+OR+g2186+OR+g2189+OR+g153+OR+g158+OR+g166+OR+g167+OR+g168+OR+g169+OR+g2225+OR+g179+OR+g2227+OR+g2229+OR+g183+OR+g2231+OR+g184+OR+g2233+OR+g186+OR+g2235+OR+g2237+OR+g191+OR+g192+OR+g193+OR+g202+OR+g203+OR+g204+OR+g205+OR+g207+OR+g208+OR+g218+OR+g219+OR+g222+OR+g223+OR+g230+OR+g231+OR+g238+OR+g241+OR+g244+OR+g254+OR+g255+OR+g262+OR+g265+OR+g268+OR+g269+OR+g273+OR+g276+OR+g277+OR+g279+OR+g282+OR+g2332+OR+g2335+OR+g2338+OR+g292+OR+g293+OR+g2341+OR+g296+OR+g2344+OR+g297+OR+g2347+OR+g301+OR+g2350+OR+g303+OR+g305+OR+g2356+OR+g310+OR+g311+OR+g2359+OR+g313+OR+g2362+OR+g2365+OR+g2368+OR+g321+OR+g2371+OR+g325+OR+g2374+OR+g328+OR+g2377+OR+g2380+OR+g333+OR+g2383+OR+g2386+OR+g2389+OR+g342+OR+g343+OR+g2392+OR+g345+OR+g2395+OR+g348+OR+g2398+OR+g2401+OR+g2404+OR+g2407+OR+g364+OR+g366+OR+g2425+OR+g2427+OR+g385+OR+g387+OR+g388+OR+g389+OR+g2442+OR+g395+OR+g2443+OR+g2444+OR+g401+OR+g403+OR+g405+OR+g408+OR+g2457+OR+g2458+OR+g411+OR+g2459+OR+g414+OR+g2463+OR+g417+OR+g2465+OR+g2467+OR+g421+OR+g2469+OR+g2471+OR+g424+OR+g2473+OR+g2475+OR+g2476+OR+g429+OR+g433+OR+g2481+OR+g2482+OR+g2483+OR+g443+OR+g444+OR+g445+OR+g446+OR+g448+OR+g453+OR+g455+OR+g456+OR+g457+OR+g458+OR+g459+OR+g461+OR+g462+OR+g463+OR+g464+OR+g465+OR+g467+OR+g468+OR+g469+OR+g474+OR+g476+OR+g477+OR+g480+OR+g483+OR+g484+OR+g493+OR+g496+OR+g497+OR+g498+OR+g500+OR+g502+OR+g504+OR+g505+OR+g2559+OR+g2560+OR+g513+OR+g2561+OR+g515+OR+g516+OR+g518+OR+g519+OR+g2567+OR+g520+OR+g521+OR+g522+OR+g2570+OR+g523+OR+g2571+OR+g524+OR+g525+OR+g2573+OR+g526+OR+g2574+OR+g527+OR+g528+OR+g2576+OR+g529+OR+g531+OR+g2579+OR+g533+OR+g534+OR+g2582+OR+g535+OR+g2584+OR+g538+OR+g2586+OR+g540+OR+g2588+OR+g541+OR+g543+OR+g544+OR+g545+OR+g546+OR+g548+OR+g2596+OR+g549+OR+g551+OR+g555+OR+g556+OR+g558+OR+g561+OR+g569+OR+g570+OR+g571+OR+g2619+OR+g572+OR+g2620+OR+g573+OR+g2621+OR+g2622+OR+g575+OR+g578+OR+g581+OR+g582+OR+g584+OR+g585+OR+g586+OR+g587+OR+g588+OR+g590+OR+g591+OR+g593+OR+g595+OR+g596+OR+g598+OR+g599+OR+g601+OR+g602+OR+g603+OR+g604+OR+g605+OR+g606+OR+g608+OR+g609+OR+g610+OR+g612+OR+g614+OR+g616+OR+g620+OR+g621+OR+g623+OR+g630+OR+g635+OR+g636+OR+g646+OR+g649+OR+g683+OR+g684+OR+g687+OR+g689+OR+g691+OR+g695+OR+g697+OR+g698+OR+g699+OR+g700+OR+g701+OR+g707+OR+g708+OR+g709+OR+g710+OR+g711+OR+g712+OR+g713+OR+g714+OR+g715+OR+g716+OR+g717+OR+g719+OR+g720+OR+g729+OR+g732+OR+g733+OR+g734+OR+g736+OR+g737+OR+g738+OR+g2786+OR+g752+OR+g754+OR+g2804+OR+g757+OR+g2805+OR+g2806+OR+g760+OR+g761+OR+g2810+OR+g2815+OR+g769+OR+g771+OR+g773+OR+g776+OR+g786+OR+g787+OR+g788+OR+g789+OR+g791+OR+g792+OR+g793+OR+g794+OR+g795+OR+g796+OR+g798+OR+g800+OR+g802+OR+g803+OR+g806+OR+g808+OR+g810+OR+g814+OR+g815+OR+g817+OR+g829+OR+g830+OR+g849+OR+g893+OR+g895+OR+g898+OR+g902+OR+g903+OR+g917+OR+g919+OR+g921+OR+g922+OR+g923+OR+g924+OR+g925+OR+g926+OR+g927+OR+g928+OR+g929+OR+g930+OR+g932+OR+g933+OR+g934+OR+g938+OR+g939+OR+g944+OR+g945+OR+g946+OR+g947+OR+g948+OR+g949+OR+g950+OR+g951+OR+g953+OR+g954+OR+g955+OR+g956+OR+g958+OR+g959+OR+g960+OR+g963+OR+g964+OR+g965+OR+g968+OR+g969+OR+g970+OR+g971+OR+g972+OR+g973+OR+g974+OR+g976+OR+g978+OR+g979+OR+g984+OR+g985+OR+g987+OR+g988+OR+g991+OR+g993+OR+g994+OR+g999+OR+g1000+OR+g1003+OR+g1005+OR+g1006+OR+g1007+OR+g1012+OR+g1013+OR+g1015+OR+g1016+OR+g1018+OR+g1023+OR+g1024+OR+g1026+OR+g1028+OR+g1030+OR+g1032+OR+g1033+OR+g1035+OR+g1036+OR+g1038+OR+g1039+OR+g1041+OR+g1042+OR+g1044+OR+g1045+OR+g1047+OR+g1048+OR+g1050+OR+g1051+OR+g1053+OR+g1054+OR+g1056+OR+g1057+OR+g1058+OR+g1059+OR+g1060+OR+g1061+OR+g1062+OR+g1063+OR+g1064+OR+g1065+OR+g1066+OR+g1068+OR+g1071+OR+g1072+OR+g1074+OR+g1075+OR+g1076+OR+g1077+OR+g1078+OR+g1080+OR+g1081+OR+g1082+OR+g1084+OR+g1085+OR+g1087+OR+g1088+OR+g1089+OR+g1090+OR+g1091+OR+g1092+OR+g1093+OR+g1094+OR+g1095+OR+g1096+OR+g1097+OR+g1106+OR+g1108+OR+g1110+OR+g1112+OR+g1114+OR+g1117+OR+g1120+OR+g1121+OR+g1126+OR+g1128+OR+g1129+OR+g1131+OR+g1136+OR+g1138+OR+g1140+OR+g1141+OR+g1143+OR+g1145+OR+g1146+OR+g1148+OR+g1152+OR+g1154+OR+g1156+OR+g1158+OR+g1159+OR+g1160+OR+g1162+OR+g1163+OR+g1165+OR+g1166+OR+g1168+OR+g1170+OR+g1172+OR+g1175+OR+g1177+OR+g1179+OR+g1181+OR+g1185+OR+g1191+OR+g1193+OR+g1197+OR+g1199+OR+g1201+OR+g1203+OR+g1204+OR+g1215+OR+g1217+OR+g1219+OR+g1221+OR+g1224+OR+g1226+OR+g1227+OR+g1228+OR+g1230+OR+g1231+OR+g1232+OR+g1233+OR+g1234+OR+g1235+OR+g1236+OR+g1237+OR+g1238+OR+g1240+OR+g1241+OR+g1242+OR+g1243+OR+g1244+OR+g1246+OR+g1248+OR+g1250+OR+g1252+OR+g1254+OR+g1256+OR+g1257+OR+g1259+OR+g1261+OR+g1263+OR+g1275+OR+g1276+OR+g1277+OR+g1278+OR+g1279+OR+g1282+OR+g1284+OR+g1288+OR+g1290+OR+g1293+OR+g1296+OR+g1297+OR+g1299+OR+g1303+OR+g1304+OR+g1306+OR+g1309+OR+g1310+OR+g1311+OR+g1312+OR+g1313+OR+g1316+OR+g1318+OR+g1320+OR+g1322+OR+g1323+OR+g1324+OR+g1325+OR+g1326+OR+g1329+OR+g1331+OR+g1347+OR+g1348+OR+g1361+OR+g1362+OR+g1363+OR+g1364+OR+g1367+OR+g1368+OR+g1369+OR+g1370+OR+g1371+OR+g1374+OR+g1376+OR+g1377+OR+g1378+OR+g1380+OR+g1381+OR+g1386+OR+g1389+OR+g1391+OR+g1392+OR+g1393+OR+g1395+OR+g1396+OR+g1397+OR+g1400+OR+g1402+OR+g1406+OR+g1408+OR+g1415+OR+g1417+OR+g1433+OR+g1435+OR+g1441+OR+g1442+OR+g1443+OR+g1444+OR+g1446+OR+g1448+OR+g1450+OR+g1451+OR+g1452+OR+g1453+OR+g1454+OR+g1456+OR+g1458+OR+g1460+OR+g1462+OR+g1464+OR+g1466+OR+g1468+OR+g1470+OR+g1471+OR+g1475+OR+g1476+OR+g1477+OR+g1478+OR+g1479+OR+g1481+OR+g1482+OR+g1483+OR+g1484+OR+g1485+OR+g1486+OR+g1487+OR+g1488+OR+g1489+OR+g1490+OR+g1491+OR+g1492+OR+g1493+OR+g1495+OR+g1497+OR+g1499+OR+g1501+OR+g1503+OR+g1504+OR+g1506+OR+g1508+OR+g1511+OR+g1512+OR+g1513+OR+g1516+OR+g1522+OR+g1535+OR+g1536+OR+g1537+OR+g1539+OR+g1540+OR+g1541+OR+g1542+OR+g1547+OR+g1549+OR+g1551+OR+g1553+OR+g1555+OR+g1557+OR+g1559+OR+g1561+OR+g1563+OR+g1565+OR+g1567+OR+g1569+OR+g1571+OR+g1573+OR+g1580+OR+g1583+OR+g1588+OR+g1590+OR+g1592+OR+g1594+OR+g1595+OR+g1596+OR+g1598+OR+g1599+OR+g1600+OR+g1601+OR+g1602+OR+g1604+OR+g1606+OR+g1610+OR+g1611+OR+g1612+OR+g1613+OR+g1616+OR+g1619+OR+g1622+OR+g1624+OR+g1625+OR+g1626+OR+g1628+OR+g1629+OR+g1631+OR+g1632+OR+g1692+OR+g1694+OR+g1695+OR+g1697+OR+g1705+OR+g1706+OR+g1707+OR+g1708+OR+g1711+OR+g1715+OR+g1717+OR+g1719+OR+g1721+OR+g1722+OR+g1723+OR+g1724+OR+g1725+OR+g1726+OR+g1727+OR+g1731+OR+g1732+OR+g1736+OR+g1737+OR+g1738+OR+g1740+OR+g1742+OR+g1743+OR+g1753+OR+g1755+OR+g1758+OR+g1759+OR+g1764+OR+g1766+OR+g1769+OR+g1774+OR+g1782+OR+g1794+OR+g1796+OR+g1797+OR+g1814+OR+g1818+OR+g1826+OR+g1853+OR+g1855+OR+g1857+OR+g1858+OR+g1859+OR+g1860+OR+g1861+OR+g1863+OR+g1864+OR+g1865+OR+g1867+OR+g1869+OR+g1871+OR+g1873+OR+g1875+OR+g1877+OR+g1879+OR+g1881+OR+g1883+OR+g1884+OR+g1885+OR+g1887+OR+g1889+OR+g1891+OR+g1892+OR+g1894+OR+g1896+OR+g1898+OR+g1900+OR+g1902+OR+g1907+OR+g1910+OR+g1915+OR+g1916+OR+g1917+OR+g1918+OR+g1929+OR+g1931+OR+g1932+OR+g1933+OR+g1934+OR+g1936+OR+g1937+OR+g1938+OR+g1939+OR+g1940+OR+g1942+OR+g1944+OR+g1945+OR+g1948+OR+g1950+OR+g1955+OR+g1961+OR+g1962+OR+g1964+OR+g1966+OR+g1968+OR+g1970+OR+g1972+OR+g1974+OR+g1976+OR+g1979+OR+g1982+OR+g1984+OR+g1985+OR+g1986+OR+g1987+OR+g1989+OR+g1991+OR+g1996+OR+g2003+OR+g2007+OR+g2011+OR+g2019+OR+g2020+OR+g2046)&sort=dateIssued.year_sort+desc&rows=1&wt=javabin&version=2} hits=56080 status=0 QTime=3
    -
    +
  • -

    2016-11-30

    diff --git a/docs/2016-12/index.html b/docs/2016-12/index.html index 193e9d033..c31c76c92 100644 --- a/docs/2016-12/index.html +++ b/docs/2016-12/index.html @@ -10,8 +10,8 @@ CGSpace was down for five hours in the morning while I was sleeping -While looking in the logs for errors, I see tons of warnings about Atmire MQM: +While looking in the logs for errors, I see tons of warnings about Atmire MQM: 2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607") 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607") @@ -20,9 +20,10 @@ While looking in the logs for errors, I see tons of warnings about Atmire MQM: 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607") - I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade + I’ve raised a ticket with Atmire to ask + Another worrying error from dspace.log is: " /> @@ -36,8 +37,8 @@ Another worrying error from dspace.log is: CGSpace was down for five hours in the morning while I was sleeping -While looking in the logs for errors, I see tons of warnings about Atmire MQM: +While looking in the logs for errors, I see tons of warnings about Atmire MQM: 2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607") 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607") @@ -46,12 +47,13 @@ While looking in the logs for errors, I see tons of warnings about Atmire MQM: 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607") - I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade + I’ve raised a ticket with Atmire to ask + Another worrying error from dspace.log is: "/> - + @@ -134,20 +136,21 @@ Another worrying error from dspace.log is: + +
  • While looking in the logs for errors, I see tons of warnings about Atmire MQM:

    2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
    -
    +
  • -
    org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceObjectDatasetGenerator.toDatasetQuery(Lorg/dspace/core/Context;)Lcom/atmire/statistics/content/DatasetQuery;
    @@ -239,35 +242,35 @@ Caused by: java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceOb
     
    +
  • The first error I see in dspace.log this morning is:

    2016-12-02 03:00:46,656 ERROR org.dspace.authority.AuthorityValueFinder @ anonymous::Error while retrieving AuthorityValue from solr:query\colon; id\colon;"b0b541c1-ec15-48bf-9209-6dbe8e338cdc"
     org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8081/solr/authority
    -
    +
  • - +
  • Looking through DSpace’s solr log I see that about 20 seconds before this, there were a few 30+ KiB solr queries

  • + +
  • The last logs here right before Solr became unresponsive (and right after I restarted it five hours later) were:

    2016-12-02 03:00:42,606 INFO  org.apache.solr.core.SolrCore @ [statistics] webapp=/solr path=/select params={q=containerItem:72828+AND+type:0&shards=localhost:8081/solr/statistics-2010,localhost:8081/solr/statistics&fq=-isInternal:true&fq=-(author_mtdt:"CGIAR\+Institutional\+Learning\+and\+Change\+Initiative"++AND+subject_mtdt:"PARTNERSHIPS"+AND+subject_mtdt:"RESEARCH"+AND+subject_mtdt:"AGRICULTURE"+AND+subject_mtdt:"DEVELOPMENT"++AND+iso_mtdt:"en"+)&rows=0&wt=javabin&version=2} hits=0 status=0 QTime=19
     2016-12-02 08:28:23,908 INFO  org.apache.solr.servlet.SolrDispatchFilter @ SolrDispatchFilter.init()
    -
    +
  • -

    2016-12-04

    + +
  • It says 732 bitstreams have potential issues, for example:

    ------------------------------------------------ 
     Bitstream Id = 6
    @@ -286,14 +289,15 @@ Checksum Expected = 9959301aa4ca808d00957dff88214e38
     Checksum Calculated = 
     Result = The bitstream could not be found
     ----------------------------------------------- 
    -
    +
  • - +
  • The first one seems ok, but I don’t know what to make of the second one…

  • + +
  • I had a look and there is indeed no file with the second checksum in the assetstore (ie, looking in [dspace-dir]/assetstore/99/59/30/...)

  • + +
  • For what it’s worth, there is no item on DSpace Test or S3 backups with that checksum either…

  • + +
  • In other news, I’m looking at JVM settings from the Solr 4.10.2 release, from bin/solr.in.sh:

    # These GC settings have shown to work well for a number of common Solr workloads
     GC_TUNE="-XX:-UseSuperWord \
    @@ -314,11 +318,11 @@ GC_TUNE="-XX:-UseSuperWord \
     -XX:+CMSParallelRemarkEnabled \
     -XX:+ParallelRefProcEnabled \
     -XX:+AggressiveOpts"
    -
    +
  • -

    2016-12-05

    @@ -330,21 +334,19 @@ GC_TUNE="-XX:-UseSuperWord \
  • I did a few traceroutes from Jordan and Kenya and it seems that Linode’s Frankfurt datacenter is a few less hops and perhaps less packet loss than the London one, so I put the new server in Frankfurt
  • Do initial provisioning
  • Atmire responded about the MQM warnings in the DSpace logs
  • -
  • Apparently we need to change the batch edit consumers in dspace/config/dspace.cfg:
  • - + +
  • Apparently we need to change the batch edit consumers in dspace/config/dspace.cfg:

    event.consumer.batchedit.filters = Community|Collection+Create
    -
    +
  • -

    2016-12-06

    +
  • Some author authority corrections and name standardizations for Peter:

    dspace=# update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%';
     UPDATE 11
    @@ -358,47 +360,55 @@ dspace=# update metadatavalue set authority='0d8369bb-57f7-4b2f-92aa-af820b183ac
     UPDATE 360
     dspace=# update metadatavalue set text_value='Grace, Delia', authority='0b4fcbc1-d930-4319-9b4d-ea1553cca70b', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
     UPDATE 561
    -
    +
  • - +
  • Pay attention to the regex to prevent false positives in tricky cases with Dutch names!

  • + +
  • I will run these updates on DSpace Test and then force a Discovery reindex, and then run them on CGSpace next week

  • + +
  • More work on the KM4Dev Journal article

  • + +
  • In other news, it seems the batch edit patch is working, there are no more WARN errors in the logs and the batch edit seems to work

  • + +
  • I need to check the CGSpace logs to see if there are still errors there, and then deploy/monitor it there

  • + +
  • Paola from CCAFS mentioned she also has the “take task” bug on CGSpace

  • + +
  • Reading about shared_buffers in PostgreSQL configuration (default is 128MB)

  • + +
  • Looks like we have ~5GB of memory used by caches on the test server (after OS and JVM heap!), so we might as well bump up the buffers for Postgres

  • + +
  • The docs say a good starting point for a dedicated server is 25% of the system RAM, and our server isn’t dedicated (also runs Solr, which can benefit from OS cache) so let’s try 1024MB

  • + +
  • In other news, the authority reindexing keeps crashing (I was manually running it after the author updates above):

    $ time JAVA_OPTS="-Xms768m -Xmx768m -Dfile.encoding=UTF-8" /home/dspacetest.cgiar.org/bin/dspace index-authority
     Retrieving all data
     Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer
     Exception: null
     java.lang.NullPointerException
    -        at org.dspace.authority.AuthorityValueGenerator.generateRaw(AuthorityValueGenerator.java:82)
    -        at org.dspace.authority.AuthorityValueGenerator.generate(AuthorityValueGenerator.java:39)
    -        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.prepareNextValue(DSpaceAuthorityIndexer.java:201)
    -        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:132)
    -        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    -        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    -        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:159)
    -        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    -        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    -        at org.dspace.authority.indexer.AuthorityIndexClient.main(AuthorityIndexClient.java:61)
    -        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -        at java.lang.reflect.Method.invoke(Method.java:498)
    -        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    -        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    +    at org.dspace.authority.AuthorityValueGenerator.generateRaw(AuthorityValueGenerator.java:82)
    +    at org.dspace.authority.AuthorityValueGenerator.generate(AuthorityValueGenerator.java:39)
    +    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.prepareNextValue(DSpaceAuthorityIndexer.java:201)
    +    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:132)
    +    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    +    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    +    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:159)
    +    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    +    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    +    at org.dspace.authority.indexer.AuthorityIndexClient.main(AuthorityIndexClient.java:61)
    +    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    +    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    +    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    +    at java.lang.reflect.Method.invoke(Method.java:498)
    +    at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    +    at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
     
     real    8m39.913s
     user    1m54.190s
     sys     0m22.647s
    -
    +
  • +

    2016-12-07

    @@ -407,108 +417,107 @@ sys 0m22.647s
  • I will have to test more
  • Anyways, I noticed that some of the authority values I set actually have versions of author names we don’t want, ie “Grace, D.”
  • For example, do a Solr query for “first_name:Grace” and look at the results
  • -
  • Querying that ID shows the fields that need to be changed:
  • - + +
  • Querying that ID shows the fields that need to be changed:

    {
    -  "responseHeader": {
    -    "status": 0,
    -    "QTime": 1,
    -    "params": {
    -      "q": "id:0b4fcbc1-d930-4319-9b4d-ea1553cca70b",
    -      "indent": "true",
    -      "wt": "json",
    -      "_": "1481102189244"
    -    }
    -  },
    -  "response": {
    -    "numFound": 1,
    -    "start": 0,
    -    "docs": [
    -      {
    -        "id": "0b4fcbc1-d930-4319-9b4d-ea1553cca70b",
    -        "field": "dc_contributor_author",
    -        "value": "Grace, D.",
    -        "deleted": false,
    -        "creation_date": "2016-11-10T15:13:40.318Z",
    -        "last_modified_date": "2016-11-10T15:13:40.318Z",
    -        "authority_type": "person",
    -        "first_name": "D.",
    -        "last_name": "Grace"
    -      }
    -    ]
    -  }
    +"responseHeader": {
    +"status": 0,
    +"QTime": 1,
    +"params": {
    +  "q": "id:0b4fcbc1-d930-4319-9b4d-ea1553cca70b",
    +  "indent": "true",
    +  "wt": "json",
    +  "_": "1481102189244"
     }
    -
    +}, +"response": { +"numFound": 1, +"start": 0, +"docs": [ + { + "id": "0b4fcbc1-d930-4319-9b4d-ea1553cca70b", + "field": "dc_contributor_author", + "value": "Grace, D.", + "deleted": false, + "creation_date": "2016-11-10T15:13:40.318Z", + "last_modified_date": "2016-11-10T15:13:40.318Z", + "authority_type": "person", + "first_name": "D.", + "last_name": "Grace" + } +] +} +} +
  • - +
  • I think I can just update the value, first_name, and last_name fields…

  • + +
  • The update syntax should be something like this, but I’m getting errors from Solr:

    $ curl 'localhost:8081/solr/authority/update?commit=true&wt=json&indent=true' -H 'Content-type:application/json' -d '[{"id":"1","price":{"set":100}}]'
     {
    -  "responseHeader":{
    -    "status":400,
    -    "QTime":0},
    -  "error":{
    -    "msg":"Unexpected character '[' (code 91) in prolog; expected '<'\n at [row,col {unknown-source}]: [1,1]",
    -    "code":400}}
    -
    +"responseHeader":{ +"status":400, +"QTime":0}, +"error":{ +"msg":"Unexpected character '[' (code 91) in prolog; expected '<'\n at [row,col {unknown-source}]: [1,1]", +"code":400}} +
  • - +
  • When I try using the XML format I get an error that the updateLog needs to be configured for that core

  • + +
  • Maybe I can just remove the authority UUID from the records, run the indexing again so it creates a new one for each name variant, then match them correctly?

    dspace=# update metadatavalue set authority=null, confidence=-1 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
     UPDATE 561
    -
    +
  • - +
  • Then I’ll reindex discovery and authority and see how the authority Solr core looks

  • + +
  • After this, now there are authorities for some of the “Grace, D.” and “Grace, Delia” text_values in the database (the first version is actually the same authority that already exists in the core, so it was just added back to some text_values, but the second one is new):

    $ curl 'localhost:8081/solr/authority/select?q=id%3A18ea1525-2513-430a-8817-a834cd733fbc&wt=json&indent=true'
     {
    -  "responseHeader":{
    -    "status":0,
    -    "QTime":0,
    -    "params":{
    -      "q":"id:18ea1525-2513-430a-8817-a834cd733fbc",
    -      "indent":"true",
    -      "wt":"json"}},
    -  "response":{"numFound":1,"start":0,"docs":[
    -      {
    -        "id":"18ea1525-2513-430a-8817-a834cd733fbc",
    -        "field":"dc_contributor_author",
    -        "value":"Grace, Delia",
    -        "deleted":false,
    -        "creation_date":"2016-12-07T10:54:34.356Z",
    -        "last_modified_date":"2016-12-07T10:54:34.356Z",
    -        "authority_type":"person",
    -        "first_name":"Delia",
    -        "last_name":"Grace"}]
    -  }}
    -
    +"responseHeader":{ +"status":0, +"QTime":0, +"params":{ + "q":"id:18ea1525-2513-430a-8817-a834cd733fbc", + "indent":"true", + "wt":"json"}}, +"response":{"numFound":1,"start":0,"docs":[ + { + "id":"18ea1525-2513-430a-8817-a834cd733fbc", + "field":"dc_contributor_author", + "value":"Grace, Delia", + "deleted":false, + "creation_date":"2016-12-07T10:54:34.356Z", + "last_modified_date":"2016-12-07T10:54:34.356Z", + "authority_type":"person", + "first_name":"Delia", + "last_name":"Grace"}] +}} +
  • - +
  • So now I could set them all to this ID and the name would be ok, but there has to be a better way!

  • + +
  • In this case it seems that since there were also two different IDs in the original database, I just picked the wrong one!

  • + +
  • Better to use:

    dspace#= update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
    -
    +
  • - +
  • This proves that unifying author name varieties in authorities is easy, but fixing the name in the authority is tricky!

  • + +
  • Perhaps another way is to just add our own UUID to the authority field for the text_value we like, then re-index authority so they get synced from PostgreSQL to Solr, then set the other text_values to use that authority ID

  • + +
  • Deploy MQM WARN fix on CGSpace (#289)

  • + +
  • Deploy “take task” hack/fix on CGSpace (#290)

  • + +
  • I ran the following author corrections and then reindexed discovery:

    update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%';
     update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Hoek, R%';
    @@ -516,68 +525,63 @@ update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-
     update metadatavalue set authority='18349f29-61b1-44d7-ac60-89e55546e812', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne, P%';
     update metadatavalue set authority='0d8369bb-57f7-4b2f-92aa-af820b183aca', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thornton, P%';
     update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
    -
    +
  • +

    2016-12-08

    +
  • Something weird happened and Peter Thorne’s names all ended up as “Thorne”, I guess because the original authority had that as its name value:

    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne%';
    -    text_value    |              authority               | confidence
    +text_value    |              authority               | confidence
     ------------------+--------------------------------------+------------
    - Thorne, P.J.     | 18349f29-61b1-44d7-ac60-89e55546e812 |        600
    - Thorne           | 18349f29-61b1-44d7-ac60-89e55546e812 |        600
    - Thorne-Lyman, A. | 0781e13a-1dc8-4e3f-82e8-5c422b44a344 |         -1
    - Thorne, M. D.    | 54c52649-cefd-438d-893f-3bcef3702f07 |         -1
    - Thorne, P.J      | 18349f29-61b1-44d7-ac60-89e55546e812 |        600
    - Thorne, P.       | 18349f29-61b1-44d7-ac60-89e55546e812 |        600
    +Thorne, P.J.     | 18349f29-61b1-44d7-ac60-89e55546e812 |        600
    +Thorne           | 18349f29-61b1-44d7-ac60-89e55546e812 |        600
    +Thorne-Lyman, A. | 0781e13a-1dc8-4e3f-82e8-5c422b44a344 |         -1
    +Thorne, M. D.    | 54c52649-cefd-438d-893f-3bcef3702f07 |         -1
    +Thorne, P.J      | 18349f29-61b1-44d7-ac60-89e55546e812 |        600
    +Thorne, P.       | 18349f29-61b1-44d7-ac60-89e55546e812 |        600
     (6 rows)
    -
    +
  • - +
  • I generated a new UUID using uuidgen | tr [A-Z] [a-z] and set it along with correct name variation for all records:

    dspace=# update metadatavalue set authority='b2f7603d-2fb5-4018-923a-c4ec8d85b3bb', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812';
     UPDATE 43
    -
    +
  • - +
  • Apparently we also need to normalize Phil Thornton’s names to Thornton, Philip K.:

    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
    -     text_value      |              authority               | confidence
    + text_value      |              authority               | confidence
     ---------------------+--------------------------------------+------------
    - Thornton, P         | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    - Thornton, P K.      | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    - Thornton, P K       | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    - Thornton. P.K.      | 3e1e6639-d4fb-449e-9fce-ce06b5b0f702 |         -1
    - Thornton, P K .     | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    - Thornton, P.K.      | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    - Thornton, P.K       | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    - Thornton, Philip K  | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    - Thornton, Philip K. | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    - Thornton, P. K.     | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    +Thornton, P         | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    +Thornton, P K.      | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    +Thornton, P K       | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    +Thornton. P.K.      | 3e1e6639-d4fb-449e-9fce-ce06b5b0f702 |         -1
    +Thornton, P K .     | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    +Thornton, P.K.      | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    +Thornton, P.K       | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    +Thornton, Philip K  | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    +Thornton, Philip K. | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    +Thornton, P. K.     | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
     (10 rows)
    -
    +
  • - +
  • Seems his original authorities are using an incorrect version of the name so I need to generate another UUID and tie it to the correct name, then reindex:

    dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab764', text_value='Thornton, Philip K.', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
     UPDATE 362
    -
    +
  • -

    2016-12-09

    @@ -585,15 +589,14 @@ UPDATE 362 + +
  • Run the following author corrections on CGSpace:

    dspace=# update metadatavalue set authority='34df639a-42d8-4867-a3f2-1892075fcb3f', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812' or authority='021cd183-946b-42bb-964e-522ebff02993';
     dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab764', text_value='Thornton, Philip K.', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
    -
    +
  • -

    2016-12-11

    @@ -606,40 +609,38 @@ dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab76 postgres_connections_ALL-week

    +
  • Looking at CIAT records from last week again, they have a lot of double authors like:

    International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024eb2::600
     International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024eb2::500
     International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024eb2::0
    -
    +
  • - +
  • Some in the same dc.contributor.author field, and some in others like dc.contributor.author[en_US] etc

  • + +
  • Removing the duplicates in OpenRefine and uploading a CSV to DSpace says “no changes detected”

  • + +
  • Seems like the only way to sortof clean these up would be to start in SQL:

    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'International Center for Tropical Agriculture';
    -                  text_value                   |              authority               | confidence
    +              text_value                   |              authority               | confidence
     -----------------------------------------------+--------------------------------------+------------
    - International Center for Tropical Agriculture | cc726b78-a2f4-4ee9-af98-855c2ea31c36 |         -1
    - International Center for Tropical Agriculture |                                      |        600
    - International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 |        500
    - International Center for Tropical Agriculture | cc726b78-a2f4-4ee9-af98-855c2ea31c36 |        600
    - International Center for Tropical Agriculture |                                      |         -1
    - International Center for Tropical Agriculture | cc726b78-a2f4-4ee9-af98-855c2ea31c36 |        500
    - International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 |        600
    - International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 |         -1
    - International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 |          0
    +International Center for Tropical Agriculture | cc726b78-a2f4-4ee9-af98-855c2ea31c36 |         -1
    +International Center for Tropical Agriculture |                                      |        600
    +International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 |        500
    +International Center for Tropical Agriculture | cc726b78-a2f4-4ee9-af98-855c2ea31c36 |        600
    +International Center for Tropical Agriculture |                                      |         -1
    +International Center for Tropical Agriculture | cc726b78-a2f4-4ee9-af98-855c2ea31c36 |        500
    +International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 |        600
    +International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 |         -1
    +International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 |          0
     dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = 'International Center for Tropical Agriculture';
     UPDATE 1693
     dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', text_value='International Center for Tropical Agriculture', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%CIAT%';
     UPDATE 35
    -
    +
  • -

    2016-12-13

    @@ -659,31 +660,34 @@ UPDATE 35
  • Would probably be better to make custom logrotate files for them in the future
  • Clean up some unneeded log files from 2014 (they weren’t large, just don’t need them)
  • So basically, new cron jobs for logs should look something like this:
  • -
  • Find any file named *.log* that isn’t dspace.log*, isn’t already zipped, and is older than one day, and zip it:
  • - + +
  • Find any file named *.log* that isn’t dspace.log*, isn’t already zipped, and is older than one day, and zip it:

    # find /home/dspacetest.cgiar.org/log -regextype posix-extended -iregex ".*\.log.*" ! -iregex ".*dspace\.log.*" ! -iregex ".*\.(gz|lrz|lzo|xz)" ! -newermt "Yesterday" -exec schedtool -B -e ionice -c2 -n7 xz {} \;
    -
    +
  • - +
  • Since there is xzgrep and xzless we can actually just zip them after one day, why not?!

  • + +
  • We can keep the zipped ones for two weeks just in case we need to look for errors, etc, and delete them after that

  • + +
  • I use schedtool -B and ionice -c2 -n7 to set the CPU scheduling to SCHED_BATCH and the IO to best effort which should, in theory, impact important system processes like Tomcat and PostgreSQL less

  • + +
  • When the tasks are running you can see that the policies do apply:

    $ schedtool $(ps aux | grep "xz /home" | grep -v grep | awk '{print $2}') && ionice -p $(ps aux | grep "xz /home" | grep -v grep | awk '{print $2}')
     PID 17049: PRIO   0, POLICY B: SCHED_BATCH   , NICE   0, AFFINITY 0xf
     best-effort: prio 7
    -
    +
  • - +
  • All in all this should free up a few gigs (we were at 9.3GB free when I started)

  • + +
  • Next thing to look at is whether we need Tomcat’s access logs

  • + +
  • I just looked and it seems that we saved 10GB by zipping these logs

  • + +
  • Some users pointed out issues with the “most popular” stats on a community or collection

  • + +
  • This error appears in the logs when you try to view them:

    2016-12-13 21:17:37,486 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
     org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceObjectDatasetGenerator.toDatasetQuery(Lorg/dspace/core/Context;)Lcom/atmire/statistics/content/DatasetQuery;
    @@ -735,11 +739,11 @@ Caused by: java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceOb
     	at com.atmire.statistics.mostpopular.JSONStatsMostPopularGenerator.generate(SourceFile:246)
     	at com.atmire.app.xmlui.aspect.statistics.JSONStatsMostPopular.generate(JSONStatsMostPopular.java:145)
     	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -
    +
  • -

    2016-12-14

    @@ -777,8 +781,8 @@ Caused by: java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceOb
  • Last week, when we asked CGNET to update the DNS records this weekend, they misunderstood and did it immediately
  • We quickly told them to undo it, but I just realized they didn’t undo the IPv6 AAAA record!
  • None of our users in African institutes will have IPv6, but some Europeans might, so I need to check if any submissions have been added since then
  • -
  • Update some names and authorities in the database:
  • - + +
  • Update some names and authorities in the database:

    dspace=# update metadatavalue set authority='5ff35043-942e-4d0a-b377-4daed6e3c1a3', confidence=600, text_value='Duncan, Alan' where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*Duncan,? A.*';
     UPDATE 204
    @@ -786,15 +790,17 @@ dspace=# update metadatavalue set authority='46804b53-ea30-4a85-9ccf-b79a35816fa
     UPDATE 89
     dspace=# update metadatavalue set authority='f840da02-26e7-4a74-b7ba-3e2b723f3684', confidence=600, text_value='Lukuyu, Ben A.' where resource_type_id=2 and metadata_field_id=3 and text_value like '%Lukuyu, B%';
     UPDATE 140
    -
    +
  • - +
  • Generated a new UUID for Ben using uuidgen | tr [A-Z] [a-z] as the one in Solr had his ORCID but the name format was incorrect

  • + +
  • In theory DSpace should be able to check names from ORCID and update the records in the database, but I find that this doesn’t work (see Jira bug DS-3302)

  • + +
  • I need to run these updates along with the other one for CIAT that I found last week

  • + +
  • Enable OCSP stapling for hosts >= Ubuntu 16.04 in our Ansible playbooks (#76)

  • + +
  • Working for DSpace Test on the second response:

    $ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgiar.org -tls1_2 -tlsextdebug -status
     ...
    @@ -803,21 +809,18 @@ $ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgia
     ...
     OCSP Response Data:
     ...
    -    Cert Status: good
    -
    +Cert Status: good +
  • - +
  • Migrate CGSpace to new server, roughly following these steps:

  • + +
  • On old server:

    # service tomcat7 stop
     # /home/backup/scripts/postgres_backup.sh
    -
    +
  • - +
  • On new server:

    # systemctl stop tomcat7
     # rsync -4 -av --delete 178.79.187.182:/home/cgspace.cgiar.org/assetstore/ /home/cgspace.cgiar.org/assetstore/
    @@ -843,10 +846,9 @@ $ cd src/git/DSpace/dspace/target/dspace-installer
     $ ant update clean_backups
     $ exit
     # systemctl start tomcat7
    -
    +
  • - + +
  • For example, this shows 186 mappings for the item, the first three of which are real:

    dspace=#  select * from collection2item where item_id = '80596';
    -
    +
  • - +
  • Then I deleted the others:

    dspace=# delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
    -
    +
  • -

    2017-01-11

    + +
  • Error in fix-metadata-values.py when it tries to print the value for Entwicklung & Ländlicher Raum:

    Traceback (most recent call last):
    -  File "./fix-metadata-values.py", line 80, in <module>
    -    print("Fixing {} occurences of: {}".format(records_to_fix, record[0]))
    +File "./fix-metadata-values.py", line 80, in <module>
    +print("Fixing {} occurences of: {}".format(records_to_fix, record[0]))
     UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15: ordinal not in range(128)
    -
    +
  • - +
  • Seems we need to encode as UTF-8 before printing to screen, ie:

    print("Fixing {} occurences of: {}".format(records_to_fix, record[0].encode('utf-8')))
    -
    +
  • - +
  • See: http://stackoverflow.com/a/36427358/487333

  • + +
  • I’m actually not sure if we need to encode() the strings to UTF-8 before writing them to the database… I’ve never had this issue before

  • + +
  • Now back to cleaning up some journal titles so we can make the controlled vocabulary:

    $ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'fuuu'
    -
    +
  • - +
  • Now get the top 500 journal titles:

    dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
    -
    +
  • -

    2017-01-13

    @@ -287,14 +278,14 @@ UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15:

    2017-01-16

    +
  • Fix the two items Maria found with duplicate mappings with this script:

    /* 184 in correct mappings: https://cgspace.cgiar.org/handle/10568/78596 */
     delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
     /* 1 incorrect mapping: https://cgspace.cgiar.org/handle/10568/78658 */
     delete from collection2item where id = '91082';
    -
    +
  • +

    2017-01-17

    @@ -303,48 +294,43 @@ delete from collection2item where id = '91082';
  • There are about 30 files with %20 (space) and Spanish accents in the file name
  • At first I thought we should fix these, but actually it is prescribed by the W3 working group to convert these to UTF8 and URL encode them!
  • And the file names don’t really matter either, as long as the SAF Builder tool can read them—after that DSpace renames them with a hash in the assetstore
  • -
  • Seems like the only ones I should replace are the ' apostrophe characters, as %27:
  • - + +
  • Seems like the only ones I should replace are the ' apostrophe characters, as %27:

    value.replace("'",'%27')
    -
    +
  • - +
  • Add the item’s Type to the filename column as a hint to SAF Builder so it can set a more useful description field:

    value + "__description:" + cells["dc.type"].value
    -
    +
  • - +
  • Test importing of the new CIAT records (actually there are 232, not 234):

    $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/dspacetest.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/79042 --source /home/aorth/CIAT_234/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &> /tmp/ciat.log
    -
    +
  • - +
  • Many of the PDFs are 20, 30, 40, 50+ MB, which makes a total of 4GB

  • + +
  • These are scanned from paper and likely have no compression, so we should try to test if these compression techniques help without comprimising the quality too much:

    $ convert -compress Zip -density 150x150 input.pdf output.pdf
     $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf
    -
    +
  • -

    2017-01-19

    + +
  • Import 232 CIAT records into CGSpace:

    $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/68704 --source /home/aorth/CIAT_232/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &> /tmp/ciat.log
    -
    +
  • +

    2017-01-22

    @@ -357,40 +343,37 @@ $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE - + +
  • Move some old ILRI Program communities to a new subcommunity for former programs (1056879164):

    $ for community in 10568/171 10568/27868 10568/231 10568/27869 10568/150 10568/230 10568/32724 10568/172; do /home/cgspace.cgiar.org/bin/dspace community-filiator --remove --parent=10568/27866 --child="$community" && /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/79164 --child="$community"; done
    -
    +
  • - +
  • Move some collections with move-collections.sh using the following config:

    10568/42161 10568/171 10568/79341
     10568/41914 10568/171 10568/79340
    -
    +
  • +

    2017-01-24

    + +
  • Run fixes for Journal titles on CGSpace:

    $ ./fix-metadata-values.py -i /tmp/fix-49-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'password'
    -
    +
  • - +
  • Create a new list of the top 500 journal titles from the database:

    dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
    -
    +
  • -

    2017-01-25

    diff --git a/docs/2017-02/index.html b/docs/2017-02/index.html index 8712981d1..187b724e4 100644 --- a/docs/2017-02/index.html +++ b/docs/2017-02/index.html @@ -11,20 +11,19 @@ An item was mapped twice erroneously again, so I had to remove one of the mappings manually: - dspace=# select * from collection2item where item_id = '80278'; - id | collection_id | item_id +id | collection_id | item_id -------+---------------+--------- - 92551 | 313 | 80278 - 92550 | 313 | 80278 - 90774 | 1051 | 80278 +92551 | 313 | 80278 +92550 | 313 | 80278 +90774 | 1051 | 80278 (3 rows) dspace=# delete from collection2item where id = 92551 and item_id = 80278; DELETE 1 - Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301) + Looks like we’ll be using cg.identifier.ccafsprojectpii as the field name " /> @@ -39,23 +38,22 @@ Looks like we’ll be using cg.identifier.ccafsprojectpii as the field name An item was mapped twice erroneously again, so I had to remove one of the mappings manually: - dspace=# select * from collection2item where item_id = '80278'; - id | collection_id | item_id +id | collection_id | item_id -------+---------------+--------- - 92551 | 313 | 80278 - 92550 | 313 | 80278 - 90774 | 1051 | 80278 +92551 | 313 | 80278 +92550 | 313 | 80278 +90774 | 1051 | 80278 (3 rows) dspace=# delete from collection2item where id = 92551 and item_id = 80278; DELETE 1 - Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301) + Looks like we’ll be using cg.identifier.ccafsprojectpii as the field name "/> - + @@ -137,23 +135,22 @@ Looks like we’ll be using cg.identifier.ccafsprojectpii as the field name

    2017-02-07

    +
  • An item was mapped twice erroneously again, so I had to remove one of the mappings manually:

    dspace=# select * from collection2item where item_id = '80278';
    -  id   | collection_id | item_id
    +id   | collection_id | item_id
     -------+---------------+---------
    - 92551 |           313 |   80278
    - 92550 |           313 |   80278
    - 90774 |          1051 |   80278
    +92551 |           313 |   80278
    +92550 |           313 |   80278
    +90774 |          1051 |   80278
     (3 rows)
     dspace=# delete from collection2item where id = 92551 and item_id = 80278;
     DELETE 1
    -
    +
  • -

    2017-02-08

    @@ -168,11 +165,12 @@ DELETE 1
  • POLICIES AND INSTITUTIONS → PRIORITIES AND POLICIES FOR CSA
  • The climate risk management one doesn’t exist, so I will have to ask Magdalena if they want me to add it to the input forms
  • -
  • Start testing some nearly 500 author corrections that CCAFS sent me:
  • - + +
  • Start testing some nearly 500 author corrections that CCAFS sent me:

    $ ./fix-metadata-values.py -i /tmp/CCAFS-Authors-Feb-7.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu
    -
    +
  • +

    2017-02-09

    @@ -181,11 +179,12 @@ DELETE 1
  • Looks like simply adding a new metadata field to dspace/config/registries/cgiar-types.xml and restarting DSpace causes the field to get added to the rregistry
  • It requires a restart but at least it allows you to manage the registry programmatically
  • It’s not a very good way to manage the registry, though, as removing one there doesn’t cause it to be removed from the registry, and we always restore from database backups so there would never be a scenario when we needed these to be created
  • -
  • Testing some corrections on CCAFS Phase II flagships (cg.subject.ccafs):
  • - + +
  • Testing some corrections on CCAFS Phase II flagships (cg.subject.ccafs):

    $ ./fix-metadata-values.py -i ccafs-flagships-feb7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
    -
    +
  • +

    2017-02-10

    @@ -235,74 +234,57 @@ DELETE 1 + +
  • Experiment with making DSpace generate HTTPS handle links, first a change in dspace.cfg or the site’s properties file:

    handle.canonical.prefix = https://hdl.handle.net/
    -
    +
  • - +
  • And then a SQL command to update existing records:

    dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://hdl.handle.net', 'https://hdl.handle.net') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'uri');
     UPDATE 58193
    -
    +
  • - +
  • Seems to work fine!

  • + +
  • I noticed a few items that have incorrect DOI links (dc.identifier.doi), and after looking in the database I see there are over 100 that are missing the scheme or are just plain wrong:

    dspace=# select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value not like 'http%://%';
    -
    +
  • - +
  • This will replace any that begin with 10. and change them to https://dx.doi.org/10.:

    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^10\..+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like '10.%';
    -
    +
  • - +
  • This will get any that begin with doi:10. and change them to https://dx.doi.org/10.x:

    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '^doi:(10\..+$)', 'https://dx.doi.org/\1') where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'doi:10%';
    -
    +
  • - +
  • Fix DOIs like dx.doi.org/10. to be https://dx.doi.org/10.:

    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^dx.doi.org/.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'dx.doi.org/%';
    -
    +
  • - +
  • Fix DOIs like http//:

    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '^http//(dx.doi.org/.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'http//%';
    -
    +
  • - +
  • Fix DOIs like dx.doi.org./:

    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^dx.doi.org\./.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'dx.doi.org./%'
     
    -
    +
  • - +
  • Delete some invalid DOIs:

    dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value in ('DOI','CPWF Mekong','Bulawayo, Zimbabwe','bb');
    -
    +
  • - +
  • Fix some other random outliers:

    dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.1016/j.aquaculture.2015.09.003' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'http:/dx.doi.org/10.1016/j.aquaculture.2015.09.003';
     dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.5337/2016.200' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'doi: https://dx.doi.org/10.5337/2016.200';
    @@ -310,23 +292,22 @@ dspace=# update metadatavalue set text_value = 'https://dx.doi.org/doi:10.1371/j
     dspace=# update metadatavalue set text_value = 'https://dx.doi.10.1016/j.cosust.2013.11.012' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'http:dx.doi.10.1016/j.cosust.2013.11.012';
     dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.1080/03632415.2014.883570' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'org/10.1080/03632415.2014.883570';
     dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.15446/agron.colomb.v32n3.46052' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'Doi: 10.15446/agron.colomb.v32n3.46052';
    -
    +
  • - +
  • And do another round of http:// → https:// cleanups:

    dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://dx.doi.org') where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'http://dx.doi.org%';
    -
    +
  • - +
  • Run all DOI corrections on CGSpace

  • + +
  • Something to think about here is to write a Curation Task in Java to do these sanity checks / corrections every night

  • + +
  • Then we could add a cron job for them and run them from the command line like:

    [dspace]/bin/dspace curate -t noop -i 10568/79891
    -
    +
  • +

    2017-02-20

    @@ -337,8 +318,8 @@ dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.15446/agro
  • Help Sisay with SQL commands
  • Help Paola from CCAFS with the Atmire Listings and Reports module
  • Testing the fix-metadata-values.py script on macOS and it seems like we don’t need to use .encode('utf-8') anymore when printing strings to the screen
  • -
  • It seems this might have only been a temporary problem, as both Python 3.5.2 and 3.6.0 are able to print the problematic string “Entwicklung & Ländlicher Raum” without the encode() call, but print it as a bytes when it is used:
  • - + +
  • It seems this might have only been a temporary problem, as both Python 3.5.2 and 3.6.0 are able to print the problematic string “Entwicklung & Ländlicher Raum” without the encode() call, but print it as a bytes when it is used:

    $ python
     Python 3.6.0 (default, Dec 25 2016, 17:30:53)
    @@ -346,37 +327,34 @@ Python 3.6.0 (default, Dec 25 2016, 17:30:53)
     Entwicklung & Ländlicher Raum
     >>> print('Entwicklung & Ländlicher Raum'.encode())
     b'Entwicklung & L\xc3\xa4ndlicher Raum'
    -
    +
  • -

    2017-02-21

    + +
  • It seems there is a bug in filter-media that causes it to process formats that aren’t part of its configuration:

    $ [dspace]/bin/dspace filter-media -f -i 10568/16856 -p "ImageMagick PDF Thumbnail"
     File: earlywinproposal_esa_postharvest.pdf.jpg
     FILTERED: bitstream 13787 (item: 10568/16881) and created 'earlywinproposal_esa_postharvest.pdf.jpg'
     File: postHarvest.jpg.jpg
     FILTERED: bitstream 16524 (item: 10568/24655) and created 'postHarvest.jpg.jpg'
    -
    +
  • - +
  • According to dspace.cfg the ImageMagick PDF Thumbnail plugin should only process PDFs:

    filter.org.dspace.app.mediafilter.ImageMagickImageThumbnailFilter.inputFormats = BMP, GIF, image/png, JPG, TIFF, JPEG, JPEG 2000
     filter.org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.inputFormats = Adobe PDF
    -
    +
  • -

    2017-02-22

    @@ -389,24 +367,22 @@ filter.org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.inputFormats = A

    2017-02-26

    +
  • Find all fields with “http://hdl.handle.net" values (most are in dc.identifier.uri, but some are in other URL-related fields like cg.link.reference, cg.identifier.dataurl, and cg.identifier.url):

    dspace=# select distinct metadata_field_id from metadatavalue where resource_type_id=2 and text_value like 'http://hdl.handle.net%';
     dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://hdl.handle.net', 'https://hdl.handle.net') where resource_type_id=2 and metadata_field_id IN (25, 113, 179, 219, 220, 223) and text_value like 'http://hdl.handle.net%';
     UPDATE 58633
    -
    +
  • -

    2017-02-27

    +
  • LDAP users cannot log in today, looks to be an issue with CGIAR’s LDAP server:

    $ openssl s_client -connect svcgroot2.cgiarad.org:3269
     CONNECTED(00000003)
    @@ -418,15 +394,14 @@ verify error:num=21:unable to verify the first certificate
     verify return:1
     ---
     Certificate chain
    - 0 s:/CN=SVCGROOT2.CGIARAD.ORG
    -   i:/CN=CGIARAD-RDWA-CA
    +0 s:/CN=SVCGROOT2.CGIARAD.ORG
    +i:/CN=CGIARAD-RDWA-CA
     ---
    -
    +
  • - +
  • For some reason it is now signed by a private certificate authority

  • + +
  • This error seems to have started on 2017-02-25:

    $ grep -c "unable to find valid certification path" [dspace]/log/dspace.log.2017-02-*
     [dspace]/log/dspace.log.2017-02-01:0
    @@ -456,24 +431,28 @@ Certificate chain
     [dspace]/log/dspace.log.2017-02-25:7
     [dspace]/log/dspace.log.2017-02-26:8
     [dspace]/log/dspace.log.2017-02-27:90
    -
    +
  • - +
  • Also, it seems that we need to use a different user for LDAP binds, as we’re still using the temporary one from the root migration, so maybe we can go back to the previous user we were using

  • + +
  • So it looks like the certificate is invalid AND the bind users we had been using were deleted

  • + +
  • Biruk Debebe recreated the bind user and now we are just waiting for CGNET to update their certificates

  • + +
  • Regarding the filter-media issue I found earlier, it seems that the ImageMagick PDF plugin will also process JPGs if they are in the “Content Files” (aka ORIGINAL) bundle

  • + +
  • The problem likely lies in the logic of ImageMagickThumbnailFilter.java, as ImageMagickPdfThumbnailFilter.java extends it

  • + +
  • Run CIAT corrections on CGSpace

    dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = 'International Center for Tropical Agriculture';
    -
    +
  • -

    2017-02-28

    @@ -481,26 +460,23 @@ Certificate chain + +
  • I think I can do it by first exporting all metadatavalues that have the author International Center for Tropical Agriculture

    dspace=# \copy (select resource_id, metadata_value_id from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='International Center for Tropical Agriculture') to /tmp/ciat.csv with csv;
     COPY 1968
    -
    +
  • - +
  • And then use awk to print the duplicate lines to a separate file:

    $ awk -F',' 'seen[$1]++' /tmp/ciat.csv > /tmp/ciat-dupes.csv
    -
    +
  • - +
  • From that file I can create a list of 279 deletes and put them in a batch script like:

    delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and metadata_value_id=2742061;
    -
    +
  • + diff --git a/docs/2017-03/index.html b/docs/2017-03/index.html index 1ce429183..e84628e32 100644 --- a/docs/2017-03/index.html +++ b/docs/2017-03/index.html @@ -23,11 +23,12 @@ Need to send Peter and Michael some notes about this in a few days Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI Filed an issue on DSpace issue tracker for the filter-media bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516 Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK -Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568⁄51999): +Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568⁄51999): $ identify ~/Desktop/alc_contrastes_desafios.jpg /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000 + " /> @@ -53,13 +54,14 @@ Need to send Peter and Michael some notes about this in a few days Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI Filed an issue on DSpace issue tracker for the filter-media bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516 Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK -Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568⁄51999): +Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568⁄51999): $ identify ~/Desktop/alc_contrastes_desafios.jpg /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000 + "/> - + @@ -155,12 +157,13 @@ $ identify ~/Desktop/alc_contrastes_desafios.jpg
  • Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
  • Filed an issue on DSpace issue tracker for the filter-media bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
  • Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
  • -
  • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 1056851999):
  • - + +
  • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 1056851999):

    $ identify ~/Desktop/alc_contrastes_desafios.jpg
     /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
    -
    +
  • +

    2017-03-04

    @@ -212,60 +219,57 @@ DirectClass sRGB Alpha + +
  • We can only return specific results for metadata fields, like:

    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "LAND REFORM", "language": null}' | json_pp
    -
    +
  • - +
  • But there are hundreds of combinations of fields and values (like dc.subject and all the center subjects), and we can’t use wildcards in REST!

  • + +
  • Reading about enabling multiple handle prefixes in DSpace

  • + +
  • There is a mailing list thread from 2011 about it: http://dspace.2283337.n4.nabble.com/Multiple-handle-prefixes-merged-DSpace-instances-td3427192.html

  • + +
  • And a comment from Atmire’s Bram about it on the DSpace wiki: https://wiki.duraspace.org/display/DSDOC5x/Installing+DSpace?focusedCommentId=78163296#comment-78163296

  • + +
  • Bram mentions an undocumented configuration option handle.plugin.checknameauthority, but I noticed another one in dspace.cfg:

    # List any additional prefixes that need to be managed by this handle server
     # (as for examle handle prefix coming from old dspace repository merged in
     # that repository)
     # handle.additional.prefixes = prefix1[, prefix2]
    -
    +
  • - +
  • Because of this I noticed that our Handle server’s config.dct was potentially misconfigured!

  • + +
  • We had some default values still present:

    "300:0.NA/YOUR_NAMING_AUTHORITY"
    -
    +
  • - +
  • I’ve changed them to the following and restarted the handle server:

    "300:0.NA/10568"
    -
    +
  • - +
  • In looking at all the configs I just noticed that we are not providing a DOI in the Google-specific metadata crosswalk

  • + +
  • From dspace/config/crosswalks/google-metadata.properties:

    google.citation_doi = cg.identifier.doi
    -
    +
  • - +
  • This works, and makes DSpace output the following metadata on the item view page:

    <meta content="https://dx.doi.org/10.1186/s13059-017-1153-y" name="citation_doi">
    -
    +
  • -

    2017-03-06

    @@ -302,35 +306,34 @@ DirectClass sRGB Alpha

    2017-03-09

    +
  • Export list of sponsors so Peter can clean it up:

    dspace=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship') group by text_value order by count desc) to /tmp/sponsorship.csv with csv;
     COPY 285
    -
    +
  • +

    2017-03-12

    +
  • Test the sponsorship fixes and deletes from Peter:

    $ ./fix-metadata-values.py -i Investors-Fix-51.csv -f dc.description.sponsorship -t Action -m 29 -d dspace -u dspace -p fuuuu
     $ ./delete-metadata-values.py -i Investors-Delete-121.csv -f dc.description.sponsorship -m 29 -d dspace -u dspace -p fuuu
    -
    +
  • - +
  • Generate a new list of unique sponsors so we can update the controlled vocabulary:

    dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship')) to /tmp/sponsorship.csv with csv;
    -
    +
  • -

    Livestock CRP theme

    @@ -374,40 +377,36 @@ $ ./delete-metadata-values.py -i Investors-Delete-121.csv -f dc.description.spon

    2017-03-28

    +
  • CCAFS said they are ready for the flagship updates for Phase II to be run (cg.subject.ccafs), so I ran them on CGSpace:

    $ ./fix-metadata-values.py -i ccafs-flagships-feb7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
    -
    +
  • - +
  • We’ve been waiting since February to run these

  • + +
  • Also, I generated a list of all CCAFS flagships because there are a dozen or so more than there should be:

    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=210 group by text_value order by count desc) to /tmp/ccafs.csv with csv;
    -
    +
  • -

    2017-03-29

    +
  • Dump a list of fields in the DC and CG schemas to compare with CG Core:

    dspace=# select case when metadata_schema_id=1 then 'dc' else 'cg' end as schema, element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
    -
    +
  • - +
  • Ooh, a better one!

    dspace=# select coalesce(case when metadata_schema_id=1 then 'dc.' else 'cg.' end) || concat_ws('.', element, qualifier) as field, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
    -
    +
  • +

    2017-03-30

    diff --git a/docs/2017-04/index.html b/docs/2017-04/index.html index c4342b636..f937c4bf9 100644 --- a/docs/2017-04/index.html +++ b/docs/2017-04/index.html @@ -17,10 +17,11 @@ Quick proof-of-concept hack to add dc.rights to the input form, including some i Remove redundant/duplicate text in the DSpace submission license + Testing the CMYK patch on a collection with 650 items: - $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt + " /> @@ -40,12 +41,13 @@ Quick proof-of-concept hack to add dc.rights to the input form, including some i Remove redundant/duplicate text in the DSpace submission license + Testing the CMYK patch on a collection with 650 items: - $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt + "/> - + @@ -135,95 +137,88 @@ $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Th + +
  • Testing the CMYK patch on a collection with 650 items:

    $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
    -
    +
  • +

    2017-04-03

    +
  • Continue testing the CMYK patch on more communities:

    $ [dspace]/bin/dspace filter-media -f -i 10568/1 -p "ImageMagick PDF Thumbnail" -v >> /tmp/filter-media-cmyk.txt 2>&1
    -
    +
  • - +
  • So far there are almost 500:

    $ grep -c profile /tmp/filter-media-cmyk.txt
     484
    -
    +
  • - + +
  • Also, I’m noticing some weird outliers in cg.coverage.region, need to remember to go correct these later:

    dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=227;
    -
    +
  • +

    2017-04-04

    +
  • The filter-media script has been running on more large communities and now there are many more CMYK PDFs that have been fixed:

    $ grep -c profile /tmp/filter-media-cmyk.txt
     1584
    -
    +
  • - +
  • Trying to find a way to get the number of items submitted by a certain user in 2016

  • + +
  • It’s not possible in the DSpace search / module interfaces, but might be able to be derived from dc.description.provenance, as that field contains the name and email of the submitter/approver, ie:

    Submitted by Francesca Giampieri (fgiampieri) on 2016-01-19T13:56:43Z^M
     No. of bitstreams: 1^M
     ILAC_Brief21_PMCA.pdf: 113462 bytes, checksum: 249fef468f401c066a119f5db687add0 (MD5)
    -
    +
  • - +
  • This SQL query returns fields that were submitted or approved by giampieri in 2016 and contain a “checksum” (ie, there was a bitstream in the submission):

    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
    -
    +
  • - +
  • Then this one does the same, but for fields that don’t contain checksums (ie, there was no bitstream in the submission):

    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*' and text_value !~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
    -
    +
  • - +
  • For some reason there seem to be way too many fields, for example there are 498 + 13 here, which is 511 items for just this one user.

  • + +
  • It looks like there can be a scenario where the user submitted AND approved it, so some records might be doubled…

  • + +
  • In that case it might just be better to see how many the user submitted (both with and without bitstreams):

    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*giampieri.*2016-.*';
    -
    +
  • +

    2017-04-05

    +
  • After doing a few more large communities it seems this is the final count of CMYK PDFs:

    $ grep -c profile /tmp/filter-media-cmyk.txt
     2505
    -
    +
  • +

    2017-04-06

    @@ -301,8 +296,8 @@ ILAC_Brief21_PMCA.pdf: 113462 bytes, checksum: 249fef468f401c066a119f5db687add0
  • I don’t see these fields anywhere in our source code or the database’s metadata registry, so maybe it’s just a cache issue
  • I will have to check the OAI cron scripts on DSpace Test, and then run them on CGSpace
  • -
  • Running dspace oai import and dspace oai clean-cache have zero effect, but this seems to rebuild the cache from scratch:
  • - + +
  • Running dspace oai import and dspace oai clean-cache have zero effect, but this seems to rebuild the cache from scratch:

    $ /home/dspacetest.cgiar.org/bin/dspace oai import -c
     ...
    @@ -311,14 +306,15 @@ ILAC_Brief21_PMCA.pdf: 113462 bytes, checksum: 249fef468f401c066a119f5db687add0
     Total: 64056 items
     Purging cached OAI responses.
     OAI 2.0 manager action ended. It took 829 seconds.
    -
    +
  • - +
  • After reading some threads on the DSpace mailing list, I see that clean-cache is actually only for caching responses, ie to client requests in the OAI web application

  • + +
  • These are stored in [dspace]/var/oai/requests/

  • + +
  • The import command should theoretically catch situations like this where an item’s metadata was updated, but in this case we changed the metadata schema and it doesn’t seem to catch it (could be a bug!)

  • + +
  • Attempting a full rebuild of OAI on CGSpace:

    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace oai import -c
    @@ -331,12 +327,13 @@ OAI 2.0 manager action ended. It took 1032 seconds.
     real    17m20.156s
     user    4m35.293s
     sys     1m29.310s
    -
    +
  • -

    2017-04-13

    @@ -381,19 +378,20 @@ sys 1m29.310s
  • CIFOR has now implemented a new “cgiar” context in their OAI that exposes CG fields, so I am re-harvesting that to see how it looks in the Discovery sidebars and searches
  • See: https://data.cifor.org/dspace/oai/cgiar?verb=GetRecord&metadataPrefix=dim&identifier=oai:data.cifor.org:11463/947
  • One thing we need to remember if we start using OAI is to enable the autostart of the harvester process (see harvester.autoStart in dspace/config/modules/oai.cfg)
  • -
  • Error when running DSpace cleanup task on DSpace Test and CGSpace (on the same item), I need to look this up:
  • - + +
  • Error when running DSpace cleanup task on DSpace Test and CGSpace (on the same item), I need to look this up:

    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -  Detail: Key (bitstream_id)=(435) is still referenced from table "bundle".
    -
    +Detail: Key (bitstream_id)=(435) is still referenced from table "bundle". +
  • +

    2017-04-18

    + +
  • Setup and run with:

    $ git clone https://github.com/ilri/ckm-cgspace-rest-api.git
     $ cd ckm-cgspace-rest-api/app
    @@ -401,22 +399,20 @@ $ gem install bundler
     $ bundle
     $ cd ..
     $ rails -s
    -
    +
  • - +
  • I used Ansible to create a PostgreSQL user that only has SELECT privileges on the tables it needs:

    $ ansible linode02 -u aorth -b --become-user=postgres -K -m postgresql_user -a 'db=database name=username password=password priv=CONNECT/item:SELECT/metadatavalue:SELECT/metadatafieldregistry:SELECT/metadataschemaregistry:SELECT/collection:SELECT/handle:SELECT/bundle2bitstream:SELECT/bitstream:SELECT/bundle:SELECT/item2bundle:SELECT state=present
    -
    +
  • - +
  • Need to look into running this via systemd

  • + +
  • This is interesting for creating runnable commands from bundle:

    $ bundle binstubs puma --path ./sbin
    -
    +
  • +

    2017-04-19

    @@ -429,30 +425,27 @@ $ rails -s
  • Abenet noticed that the “Workflow Statistics” option is missing now, but we have screenshots from a presentation in 2016 when it was there
  • I filed a ticket with Atmire
  • Looking at 933 CIAT records from Sisay, he’s having problems creating a SAF bundle to import to DSpace Test
  • -
  • I started by looking at his CSV in OpenRefine, and I see there a bunch of fields with whitespace issues that I cleaned up:
  • - + +
  • I started by looking at his CSV in OpenRefine, and I see there a bunch of fields with whitespace issues that I cleaned up:

    value.replace(" ||","||").replace("|| ","||").replace(" || ","||")
    -
    +
  • - +
  • Also, all the filenames have spaces and URL encoded characters in them, so I decoded them from URL encoding:

    unescape(value,"url")
    -
    +
  • - +
  • Then create the filename column using the following transform from URL:

    value.split('/')[-1].replace(/#.*$/,"")
    -
    +
  • -

    2017-04-20

    @@ -461,99 +454,97 @@ $ rails -s
  • Atmire responded about the Workflow Statistics, saying that it had been disabled because many environments needed customization to be useful
  • I re-enabled it with a hidden config key workflow.stats.enabled = true on DSpace Test and will evaluate adding it on CGSpace
  • Looking at the CIAT data again, a bunch of items have metadata values ending in ||, which might cause blank fields to be added at import time
  • -
  • Cleaning them up with OpenRefine:
  • - + +
  • Cleaning them up with OpenRefine:

    value.replace(/\|\|$/,"")
    -
    +
  • -

    Flagging and filtering duplicates in OpenRefine

    + +
  • Unbelievable, there are also metadata values like:

    COLLETOTRICHUM LINDEMUTHIANUM||                  FUSARIUM||GERMPLASM
    -
    +
  • - +
  • Add a description to the file names using:

    value + "__description:" + cells["dc.type"].value
    -
    +
  • - +
  • Test import of 933 records:

    $ [dspace]/bin/dspace import -a -e aorth@mjanja.ch -c 10568/87193 -s /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ -m /tmp/ciat
     $ wc -l /tmp/ciat
     933 /tmp/ciat
    -
    +
  • - +
  • Run system updates on CGSpace and reboot server

  • + +
  • This includes switching nginx to using upstream with keepalive instead of direct proxy_pass

  • + +
  • Re-deploy CGSpace to latest 5_x-prod, including the PABRA and RTB XMLUI themes, as well as the PDF processing and CMYK changes

  • + +
  • More work on Ansible infrastructure stuff for Tsega’s CKM DSpace REST API

  • + +
  • I’m going to start re-processing all the PDF thumbnails on CGSpace, one community at a time:

    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace filter-media -f -v -i 10568/71249 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
    -
    +
  • +

    2017-04-22

    + +
  • After doing that and running the cleanup task again I find more bitstreams that are affected and end up with a long list of IDs that need to be fixed:

    dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1136, 1132, 1220, 1236, 3002, 3255, 5322);
    -
    +
  • +

    2017-04-24

    + +
  • I looked at the logs from yesterday and it seems the Discovery indexing has been crashing:

    2017-04-24 00:00:15,578 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Processing (55 of 58853): 70590
     2017-04-24 00:00:15,586 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Processing (56 of 58853): 74507
     2017-04-24 00:00:15,614 ERROR com.atmire.dspace.discovery.AtmireSolrService @ this IndexWriter is closed
     org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this IndexWriter is closed
    -        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
    -        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
    -        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
    -        at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
    -        at org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:285)
    -        at org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:271)
    -        at org.dspace.discovery.SolrServiceImpl.unIndexContent(SolrServiceImpl.java:331)
    -        at org.dspace.discovery.SolrServiceImpl.unIndexContent(SolrServiceImpl.java:315)
    -        at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:803)
    -        at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:876)
    -        at org.dspace.discovery.IndexClient.main(IndexClient.java:127)
    -        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -        at java.lang.reflect.Method.invoke(Method.java:498)
    -        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    -        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    -
    + at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552) + at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) + at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) + at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124) + at org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:285) + at org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:271) + at org.dspace.discovery.SolrServiceImpl.unIndexContent(SolrServiceImpl.java:331) + at org.dspace.discovery.SolrServiceImpl.unIndexContent(SolrServiceImpl.java:315) + at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:803) + at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:876) + at org.dspace.discovery.IndexClient.main(IndexClient.java:127) + at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) + at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) + at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) + at java.lang.reflect.Method.invoke(Method.java:498) + at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226) + at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78) +
  • - +
  • Looking at the past few days of logs, it looks like the indexing process started crashing on 2017-04-20:

    # grep -c 'IndexWriter is closed' [dspace]/log/dspace.log.2017-04-*
     [dspace]/log/dspace.log.2017-04-01:0
    @@ -580,41 +571,35 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this Index
     [dspace]/log/dspace.log.2017-04-22:13278
     [dspace]/log/dspace.log.2017-04-23:22720
     [dspace]/log/dspace.log.2017-04-24:21422
    -
    +
  • - +
  • I restarted Tomcat and re-ran the discovery process manually:

    [dspace]/bin/dspace index-discovery
    -
    +
  • - +
  • Now everything is ok

  • + +
  • Finally finished manually running the cleanup task over and over and null’ing the conflicting IDs:

    dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1132, 1136, 1220, 1236, 3002, 3255, 5322, 5098, 5982, 5897, 6245, 6184, 4927, 6070, 4925, 6888, 7368, 7136, 7294, 7698, 7864, 10799, 10839, 11765, 13241, 13634, 13642, 14127, 14146, 15582, 16116, 16254, 17136, 17486, 17824, 18098, 22091, 22149, 22206, 22449, 22548, 22559, 22454, 22253, 22553, 22897, 22941, 30262, 33657, 39796, 46943, 56561, 58237, 58739, 58734, 62020, 62535, 64149, 64672, 66988, 66919, 76005, 79780, 78545, 81078, 83620, 84492, 92513, 93915);
    -
    +
  • -

    2017-04-25

    + +
  • Preparing to run the cleanup task on CGSpace, I want to see how many files are in the assetstore:

    # find [dspace]/assetstore/ -type f | wc -l
     113104
    -
    +
  • - +
  • Troubleshooting the Atmire Solr update process that runs at 3:00 AM every morning, after finishing at 100% it has this error:

    [=================================================> ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
     [=================================================> ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
    @@ -666,33 +651,36 @@ Caused by: java.lang.ClassNotFoundException: org.dspace.statistics.content.DSpac
     	at java.lang.Class.forName(Class.java:264)
     	at com.atmire.statistics.statlet.XmlParser.parsedatasetGenerator(SourceFile:299)
     	at com.atmire.statistics.display.StatisticsGraph.parseDatasetGenerators(SourceFile:250)
    -
    +
  • -

    2017-04-26

    + +
  • Update RVM’s Ruby from 2.3.0 to 2.4.0 on DSpace Test:

    $ gpg --keyserver hkp://keys.gnupg.net --recv-keys 409B6B1796C275462A1703113804BB82D39DC0E3
     $ \curl -sSL https://raw.githubusercontent.com/wayneeseguin/rvm/master/binscripts/rvm-installer | bash -s stable --ruby
     ... reload shell to get new Ruby
     $ gem install sass -v 3.3.14
     $ gem install compass -v 1.0.3
    -
    +
  • - diff --git a/docs/2017-05/index.html b/docs/2017-05/index.html index 1d11d2f01..f8266b1e9 100644 --- a/docs/2017-05/index.html +++ b/docs/2017-05/index.html @@ -15,7 +15,7 @@ - + @@ -128,14 +128,13 @@ + +
  • Need to perhaps try using the “required metadata” curation task to find fields missing these items:

    $ [dspace]/bin/dspace curate -t requiredmetadata -i 10568/1 -r - > /tmp/curation.out
    -
    +
  • -

    2017-05-06

    @@ -149,15 +148,14 @@

    2017-05-07

    +
  • Testing one replacement for CCAFS Flagships (cg.subject.ccafs), first changed in the submission forms, and then in the database:

    $ ./fix-metadata-values.py -i ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
    -
    +
  • -

    2017-05-08

    @@ -168,19 +166,20 @@
  • When ingesting some collections I was getting java.lang.OutOfMemoryError: GC overhead limit exceeded, which can be solved by disabling the GC timeout with -XX:-UseGCOverheadLimit
  • Other times I was getting an error about heap space, so I kept bumping the RAM allocation by 512MB each time (up to 4096m!) it crashed
  • This leads to tens of thousands of abandoned files in the assetstore, which need to be cleaned up using dspace cleanup -v, or else you’ll run out of disk space
  • -
  • In the end I realized it’s better to use submission mode (-s) to ingest the community object as a single AIP without its children, followed by each of the collections:
  • - + +
  • In the end I realized it’s better to use submission mode (-s) to ingest the community object as a single AIP without its children, followed by each of the collections:

    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m -XX:-UseGCOverheadLimit"
     $ [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10568/87775 /home/aorth/10947-1/10947-1.zip
     $ for collection in /home/aorth/10947-1/COLLECTION@10947-*; do [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10947/1 $collection; done
     $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager -r -f -u -t AIP -e some@user.com $item; done
    -
    +
  • -

    2017-05-09

    + +
  • Clean these up in the database using:

    dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
    -
    +
  • - +
  • I ended up running into issues during data cleaning and decided to wipe out the entire community and re-sync DSpace Test assetstore and database from CGSpace rather than waiting for the cleanup task to clean up

  • + +
  • Hours into the re-ingestion I ran into more errors, and had to erase everything and start over again!

  • + +
  • Now, no matter what I do I keep getting foreign key errors…

    Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "handle_pkey"
    -  Detail: Key (handle_id)=(80928) already exists.
    -
    +Detail: Key (handle_id)=(80928) already exists. +
  • -

    2017-05-10

    @@ -224,8 +226,8 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager + +
  • Finally finished importing all the CGIAR Library content, final method was:

    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit"
     $ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2517/10947-2517.zip
    @@ -234,17 +236,19 @@ $ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@
     $ [dspace]/bin/dspace packager -s -t AIP -o ignoreHandle=false -e some@user.com -p 10568/80923 /home/aorth/10947-1/10947-1.zip
     $ for collection in /home/aorth/10947-1/COLLECTION@10947-*; do [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10947/1 $collection; done
     $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager -r -f -u -t AIP -e some@user.com $item; done
    -
    +
  • - +
  • Basically, import the smaller communities using recursive AIP import (with skipIfParentMissing)

  • + +
  • Then, for the larger collection, create the community, collections, and items separately, ingesting the items one by one

  • + +
  • The -XX:-UseGCOverheadLimit JVM option helps with some issues in large imports

  • + +
  • After this I ran the update-sequences.sql script (with Tomcat shut down), and cleaned up the 200+ blank metadata records:

    dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
    -
    +
  • +

    2017-05-13

    @@ -261,13 +265,12 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
  • After that I started looking in the dc.subject field to try to pull countries and regions out, but there are too many values in there
  • Bump the Academicons dependency of the Mirage 2 themes from 1.6.0 to 1.8.0 because the upstream deleted the old tag and now the build is failing: #321
  • Merge changes to CCAFS project identifiers and flagships: #320
  • -
  • Run updates for CCAFS flagships on CGSpace:
  • - + +
  • Run updates for CCAFS flagships on CGSpace:

    $ ./fix-metadata-values.py -i /tmp/ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p 'fuuu'
    -
    +
  • - + +
  • Also, they have a lot of messed up values in their cg.subject.wle field so I will clean up some of those first:

    dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id=119) to /tmp/wle.csv with csv;
     COPY 111
    -
    +
  • -

    2017-05-26

    @@ -410,38 +403,33 @@ COPY 111
  • File an issue on GitHub to explore/track migration to proper country/region codes (ISO 23 and UN M.49): #326
  • Ask Peter how the Landportal.info people should acknowledge us as the source of data on their website
  • Communicate with MARLO people about progress on exposing ORCIDs via the REST API, as it is set to be discussed in the June, 2017 DCAT meeting
  • -
  • Find all of Amos Omore’s author name variations so I can link them to his authority entry that has an ORCID:
  • - + +
  • Find all of Amos Omore’s author name variations so I can link them to his authority entry that has an ORCID:

    dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Omore, A%';
    -
    +
  • - +
  • Set the authority for all variations to one containing an ORCID:

    dspace=# update metadatavalue set authority='4428ee88-90ef-4107-b837-3c0ec988520b', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Omore, A%';
     UPDATE 187
    -
    +
  • - +
  • Next I need to do Edgar Twine:

    dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Twine, E%';
    -
    +
  • - +
  • But it doesn’t look like any of his existing entries are linked to an authority which has an ORCID, so I edited the metadata via “Edit this Item” and looked up his ORCID and linked it there

  • + +
  • Now I should be able to set his name variations to the new authority:

    dspace=# update metadatavalue set authority='f70d0a01-d562-45b8-bca3-9cf7f249bc8b', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Twine, E%';
    -
    +
  • -

    2017-05-29

    diff --git a/docs/2017-06/index.html b/docs/2017-06/index.html index bdc2710a7..6f626a54a 100644 --- a/docs/2017-06/index.html +++ b/docs/2017-06/index.html @@ -15,7 +15,7 @@ - + @@ -148,16 +148,17 @@
  • Command like: $ gs -dNOPAUSE -dBATCH -dFirstPage=14 -dLastPage=27 -sDEVICE=pdfwrite -sOutputFile=beans.pdf -f 12605-1.pdf
  • 17 of the items have issues with incorrect page number ranges, and upon closer inspection they do not appear in the referenced PDF
  • -
  • I’ve flagged them and proceeded without them (752 total) on DSpace Test:
  • - + +
  • I’ve flagged them and proceeded without them (752 total) on DSpace Test:

    $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/93843 --source /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &> /tmp/ciat-books.log
    -
    +
  • -

    2017-06-07

    @@ -167,8 +168,8 @@
  • Still doesn’t seem to give results I’d expect, like there are no results for Maria Garruccio, or for the ILRI community!
  • Then I’ll file an update to the issue on Atmire’s tracker
  • Created a new branch with just the relevant changes, so I can send it to them
  • -
  • One thing I noticed is that there is a failed database migration related to CUA:
  • - + +
  • One thing I noticed is that there is a failed database migration related to CUA:

    +----------------+----------------------------+---------------------+---------+
     | Version        | Description                | Installed on        | State   |
    @@ -194,10 +195,9 @@
     | 5.5.2015.12.03 | Atmire MQM migration       | 2016-11-27 06:39:06 | OutOrde |
     | 5.6.2016.08.08 | CUA emailreport migration  | 2017-01-29 11:18:56 | OutOrde |
     +----------------+----------------------------+---------------------+---------+
    -
    +
  • -

    2017-06-18

    @@ -220,53 +220,56 @@
  • replace(value,/<\/?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)\/?>/,'')
  • value.unescape("html").unescape("xml")
  • -
  • Finally import 914 CIAT Book Chapters to CGSpace in two batches:
  • - + +
  • Finally import 914 CIAT Book Chapters to CGSpace in two batches:

    $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &> /tmp/ciat-books.log
     $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books2.map &> /tmp/ciat-books2.log
    -
    +
  • +

    2017-06-25

    + +
  • As of now it doesn’t look like there are any items using this research theme so we don’t need to do any updates:

    dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=237 and text_value like 'Regenerating Degraded Landscapes%';
    - text_value
    +text_value
     ------------
     (0 rows)
    -
    +
  • -

    2017-06-30

    +
  • CGSpace went down briefly, I see lots of these errors in the dspace logs:

    Java stacktrace: java.util.NoSuchElementException: Timeout waiting for idle object
    -
    +
  • -

    Test A for displaying the Phase I and II research themes diff --git a/docs/2017-07/index.html b/docs/2017-07/index.html index df2732e6a..8eb7b53f1 100644 --- a/docs/2017-07/index.html +++ b/docs/2017-07/index.html @@ -39,7 +39,7 @@ Merge changes for WLE Phase II theme rename (#329) Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace We can use PostgreSQL’s extended output format (-x) plus sed to format the output into quasi XML: "/> - + @@ -155,32 +155,30 @@ We can use PostgreSQL’s extended output format (-x) plus sed to format the

    + +
  • Generate list of fields in the current CGSpace cg scheme so we can record them properly in the metadata registry:

    $ psql dspace -x -c 'select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=2 order by element, qualifier;' | sed -r 's:^-\[ RECORD (.*) \]-+$:</dc-type>\n<dc-type>\n<schema>cg</schema>:;s:([^ ]*) +\| (.*):  <\1>\2</\1>:;s:^$:</dc-type>:;1s:</dc-type>\n::' > cg-types.xml
    -
    +
  • - +
  • CGSpace was unavailable briefly, and I saw this error in the DSpace log file:

    2017-07-05 13:05:36,452 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
     org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserved for non-replication superuser connections
    -
    +
  • - +
  • Looking at the pg_stat_activity table I saw there were indeed 98 active connections to PostgreSQL, and at this time the limit is 100, so that makes sense

  • + +
  • Tsega restarted Tomcat and it’s working now

  • + +
  • Abenet said she was generating a report with Atmire’s CUA module, so it could be due to that?

  • + +
  • Looking in the logs I see this random error again that I should report to DSpace:

    2017-07-05 13:50:07,196 ERROR org.dspace.statistics.SolrLogger @ COUNTRY ERROR: EU
    -
    +
  • -

    2017-07-06

    @@ -236,14 +234,12 @@ org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserve

    2017-07-24

    +
  • Move two top-level communities to be sub-communities of ILRI Projects

    $ for community in 10568/2347 10568/25209; do /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/27629 --child="$community"; done
    -
    +
  • -

    2017-07-27

    @@ -279,27 +275,25 @@ org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserve

    2017-07-31

    +
  • Looks like the final list of metadata corrections for CCAFS project tags will be:

    delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-FP4_CRMWestAfrica';
     update metadatavalue set text_value='FP3_VietnamLED' where resource_type_id=2 and metadata_field_id=134 and text_value='FP3_VeitnamLED';
     update metadatavalue set text_value='PII-FP1_PIRCCA' where resource_type_id=2 and metadata_field_id=235 and text_value='PII-SEA_PIRCCA';
     delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-WA_IntegratedInterventions';
    -
    +
  • - +
  • Now just waiting to run them on CGSpace, and then apply the modified input forms after Macaroni Bros give me an updated list

  • + +
  • Temporarily increase the nginx upload limit to 200MB for Sisay to upload the CIAT presentations

  • + +
  • Looking at CGSpace activity page, there are 52 Baidu bots concurrently crawling our website (I copied the activity page to a text file and grep it)!

    $ grep 180.76. /tmp/status | awk '{print $5}' | sort | uniq | wc -l
     52
    -
    +
  • - diff --git a/docs/2017-08/index.html b/docs/2017-08/index.html index 81c3742c2..fad427d63 100644 --- a/docs/2017-08/index.html +++ b/docs/2017-08/index.html @@ -59,7 +59,7 @@ This was due to newline characters in the dc.description.abstract column, which I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet "/> - + @@ -220,14 +220,13 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
  • Had to do some quality checks and column renames before importing, as either Sisay or Abenet renamed a few columns and the metadata importer wanted to remove/add new metadata for title, abstract, etc.
  • Also I applied the HTML entities unescape transform on the abstract column in Open Refine
  • I need to get an author list from the database for only the CGIAR Library community to send to Peter
  • -
  • It turns out that I had already used this SQL query in May, 2017 to get the authors from CGIAR Library:
  • - + +
  • It turns out that I had already used this SQL query in May, 2017 to get the authors from CGIAR Library:

    dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9'))) group by text_value order by count desc) to /tmp/cgiar-library-authors.csv with csv;
    -
    +
  • -

    2017-08-11

    @@ -254,29 +255,29 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s + +
  • Looking at the logs for the REST API on /rest, it looks like there is someone hammering doing testing or something on it…

    # awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 5
    -    140 66.249.66.91
    -    404 66.249.66.90
    -   1479 50.116.102.77
    -   9794 45.5.184.196
    -  85736 70.32.83.92
    -
    +140 66.249.66.91 +404 66.249.66.90 +1479 50.116.102.77 +9794 45.5.184.196 +85736 70.32.83.92 +
  • - -
        # log oai requests
    -    location /oai {
    -        access_log /var/log/nginx/oai.log;
    -        proxy_pass http://tomcat_http;
    -    }
    -
    -

    2017-08-13

    + +
  • Wow, I’m playing with the AGROVOC SPARQL endpoint using the sparql-query tool:

    $ ./sparql-query http://202.45.139.84:10035/catalogs/fao/repositories/agrovoc
     sparql$ PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
     SELECT 
    -    ?label 
    +?label 
     WHERE {  
    -   {  ?concept  skos:altLabel ?label . } UNION {  ?concept  skos:prefLabel ?label . }
    -   FILTER regex(str(?label), "^fish", "i") .
    +{  ?concept  skos:altLabel ?label . } UNION {  ?concept  skos:prefLabel ?label . }
    +FILTER regex(str(?label), "^fish", "i") .
     } LIMIT 10;
     
     ┌───────────────────────┐                                                      
    @@ -505,12 +507,13 @@ WHERE {
     │ fishing times         │                                                      
     │ fish passes           │                                                      
     └───────────────────────┘
    -
    +
  • -

    2017-08-19

    @@ -526,35 +529,35 @@ WHERE { + +
  • Look at the CGIAR Library to see if I can find the items that have been submitted since May:

    dspace=# select * from metadatavalue where metadata_field_id=11 and date(text_value) > '2017-05-01T00:00:00Z';
    - metadata_value_id | item_id | metadata_field_id |      text_value      | text_lang | place | authority | confidence 
    +metadata_value_id | item_id | metadata_field_id |      text_value      | text_lang | place | authority | confidence 
     -------------------+---------+-------------------+----------------------+-----------+-------+-----------+------------
    -            123117 |    5872 |                11 | 2017-06-28T13:05:18Z |           |     1 |           |         -1
    -            123042 |    5869 |                11 | 2017-05-15T03:29:23Z |           |     1 |           |         -1
    -            123056 |    5870 |                11 | 2017-05-22T11:27:15Z |           |     1 |           |         -1
    -            123072 |    5871 |                11 | 2017-06-06T07:46:01Z |           |     1 |           |         -1
    -            123171 |    5874 |                11 | 2017-08-04T07:51:20Z |           |     1 |           |         -1
    +        123117 |    5872 |                11 | 2017-06-28T13:05:18Z |           |     1 |           |         -1
    +        123042 |    5869 |                11 | 2017-05-15T03:29:23Z |           |     1 |           |         -1
    +        123056 |    5870 |                11 | 2017-05-22T11:27:15Z |           |     1 |           |         -1
    +        123072 |    5871 |                11 | 2017-06-06T07:46:01Z |           |     1 |           |         -1
    +        123171 |    5874 |                11 | 2017-08-04T07:51:20Z |           |     1 |           |         -1
     (5 rows)
    -
    +
  • - +
  • According to dc.date.accessioned (metadata field id 11) there have only been five items submitted since May

  • + +
  • These are their handles:

    dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select item_id from metadatavalue where metadata_field_id=11 and date(text_value) > '2017-05-01T00:00:00Z');
    -   handle   
    +handle   
     ------------
    - 10947/4658
    - 10947/4659
    - 10947/4660
    - 10947/4661
    - 10947/4664
    +10947/4658
    +10947/4659
    +10947/4660
    +10947/4661
    +10947/4664
     (5 rows)
    -
    +
  • +

    2017-08-23

    @@ -575,17 +578,18 @@ WHERE {
  • I notice that in many WLE collections Marianne Gadeberg is in the edit or approval steps, but she is also in the groups for those steps.
  • I think we need to have a process to go back and check / fix some of these scenarios—to remove her user from the step and instead add her to the group—because we have way too many authorizations and in late 2016 we had performance issues with Solr because of this
  • I asked Sisay about this and hinted that he should go back and fix these things, but let’s see what he says
  • -
  • Saw CGSpace go down briefly today and noticed SQL connection pool errors in the dspace log file:
  • - + +
  • Saw CGSpace go down briefly today and noticed SQL connection pool errors in the dspace log file:

    ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error
     org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
    -
    +
  • - diff --git a/docs/2017-09/index.html b/docs/2017-09/index.html index 26de9b713..9adf4b6ca 100644 --- a/docs/2017-09/index.html +++ b/docs/2017-09/index.html @@ -35,7 +35,7 @@ Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group "/> - + @@ -129,26 +129,33 @@ Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account

    2017-09-10

    +
  • Delete 58 blank metadata values from the CGSpace database:

    dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
     DELETE 58
    -
    +
  • - +
  • I also ran it on DSpace Test because we’ll be migrating the CGIAR Library soon and it would be good to catch these before we migrate

  • + +
  • Run system updates and restart DSpace Test

  • + +
  • We only have 7.7GB of free space on DSpace Test so I need to copy some data off of it before doing the CGIAR Library migration (requires lots of exporting and creating temp files)

  • + +
  • I still have the original data from the CGIAR Library so I’ve zipped it up and sent it off to linode18 for now

  • + +
  • sha256sum of original-cgiar-library-6.6GB.tar.gz is: bcfabb52f51cbdf164b61b7e9b3a0e498479e4c1ed1d547d32d11f44c0d5eb8a

  • + +
  • Start doing a test run of the CGIAR Library migration locally

  • + +
  • Notes and todo checklist here for now: https://gist.github.com/alanorth/3579b74e116ab13418d187ed379abd9c

  • + +
  • Create pull request for Phase I and II changes to CCAFS Project Tags: #336

  • + +
  • We’ve been discussing with Macaroni Bros and CCAFS for the past month or so and the list of tags was recently finalized

  • + +
  • There will need to be some metadata updates — though if I recall correctly it is only about seven records — for that as well, I had made some notes about it in 2017-07, but I’ve asked for more clarification from Lili just in case

  • + +
  • Looking at the DSpace logs to see if we’ve had a change in the “Cannot get a connection” errors since last month when we adjusted the db.maxconnections parameter on CGSpace:

    # grep -c "Cannot get a connection, pool error Timeout waiting for idle object" dspace.log.2017-09-*
     dspace.log.2017-09-01:0
    @@ -161,14 +168,17 @@ dspace.log.2017-09-07:0
     dspace.log.2017-09-08:10
     dspace.log.2017-09-09:0
     dspace.log.2017-09-10:0
    -
    +
  • -

    2017-09-11

    @@ -183,27 +193,30 @@ dspace.log.2017-09-10:0 + +
  • Also, I captured TCP packets destined for port 80 and both imports only captured ONE packet (an update check from some component in Java):

    $ sudo tcpdump -i en0 -w without-cached-xsd.dump dst port 80 and 'tcp[32:4] = 0x47455420'
    -
    +
  • - +
  • Great TCP dump guide here: https://danielmiessler.com/study/tcpdump

  • + +
  • The last part of that command filters for HTTP GET requests, of which there should have been many to fetch all the XSD files for validation

  • + +
  • I sent a message to the mailing list to see if anyone knows more about this

  • + +
  • In looking at the tcpdump results I notice that there is an update check to the ehcache server on every iteration of the ingest loop, for example:

    09:39:36.008956 IP 192.168.8.124.50515 > 157.189.192.67.http: Flags [P.], seq 1736833672:1736834103, ack 147469926, win 4120, options [nop,nop,TS val 1175113331 ecr 550028064], length 431: HTTP: GET /kit/reflector?kitID=ehcache.default&pageID=update.properties&id=2130706433&os-name=Mac+OS+X&jvm-name=Java+HotSpot%28TM%29+64-Bit+Server+VM&jvm-version=1.8.0_144&platform=x86_64&tc-version=UNKNOWN&tc-product=Ehcache+Core+1.7.2&source=Ehcache+Core&uptime-secs=0&patch=UNKNOWN HTTP/1.1
    -
    +
  • - + +
  • With this GREL in OpenRefine I can find items that are mapped, ie they have 10568/3|| or 10568/3$ in their collection field:

    isNotNull(value.match(/.+?10568\/3(\|\|.+|$)/))
    -
    +
  • -

    2017-09-25

    @@ -650,30 +658,27 @@ DELETE 207 + +
  • Peter wants me to clean up the text values for Delia Grace’s metadata, as the authorities are all messed up again since we cleaned them up in 2016-12:

    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';                                  
    -  text_value  |              authority               | confidence              
    +text_value  |              authority               | confidence              
     --------------+--------------------------------------+------------             
    - Grace, Delia |                                      |        600              
    - Grace, Delia | bfa61d7c-7583-4175-991c-2e7315000f0c |        600              
    - Grace, Delia | bfa61d7c-7583-4175-991c-2e7315000f0c |         -1              
    - Grace, D.    | 6a8ddca3-33c1-45f9-aa00-6fa9fc91e3fc |         -1
    -
    +Grace, Delia | | 600 +Grace, Delia | bfa61d7c-7583-4175-991c-2e7315000f0c | 600 +Grace, Delia | bfa61d7c-7583-4175-991c-2e7315000f0c | -1 +Grace, D. | 6a8ddca3-33c1-45f9-aa00-6fa9fc91e3fc | -1 +
  • - +
  • Strangely, none of her authority entries have ORCIDs anymore…

  • + +
  • I’ll just fix the text values and forget about it for now:

    dspace=# update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
     UPDATE 610
    -
    +
  • - +
  • After this we have to reindex the Discovery and Authority cores (as tomcat7 user):

    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
    @@ -686,41 +691,39 @@ Retrieving all data
     Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer
     Exception: null
     java.lang.NullPointerException
    -        at org.dspace.authority.AuthorityValueGenerator.generateRaw(AuthorityValueGenerator.java:82)
    -        at org.dspace.authority.AuthorityValueGenerator.generate(AuthorityValueGenerator.java:39)
    -        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.prepareNextValue(DSpaceAuthorityIndexer.java:201)
    -        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:132)
    -        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    -        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    -        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:159)
    -        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    -        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    -        at org.dspace.authority.indexer.AuthorityIndexClient.main(AuthorityIndexClient.java:61)
    -        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -        at java.lang.reflect.Method.invoke(Method.java:498)
    -        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    -        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    +    at org.dspace.authority.AuthorityValueGenerator.generateRaw(AuthorityValueGenerator.java:82)
    +    at org.dspace.authority.AuthorityValueGenerator.generate(AuthorityValueGenerator.java:39)
    +    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.prepareNextValue(DSpaceAuthorityIndexer.java:201)
    +    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:132)
    +    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    +    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    +    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:159)
    +    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    +    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    +    at org.dspace.authority.indexer.AuthorityIndexClient.main(AuthorityIndexClient.java:61)
    +    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    +    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    +    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    +    at java.lang.reflect.Method.invoke(Method.java:498)
    +    at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    +    at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
     
     real    6m6.447s
     user    1m34.010s
     sys     0m12.113s
    -
    +
  • - +
  • The index-authority script always seems to fail, I think it’s the same old bug

  • + +
  • Something interesting for my notes about JNDI database pool—since I couldn’t determine if it was working or not when I tried it locally the other day—is this error message that I just saw in the DSpace logs today:

    ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspaceLocal
     ...
     INFO  org.dspace.storage.rdbms.DatabaseManager @ Unable to locate JNDI dataSource: jdbc/dspaceLocal
     INFO  org.dspace.storage.rdbms.DatabaseManager @ Falling back to creating own Database pool
    -
    +
  • -

    2017-09-26

    @@ -741,24 +744,23 @@ INFO org.dspace.storage.rdbms.DatabaseManager @ Falling back to creating own Da + +
  • I quickly registered a Let’s Encrypt certificate for the domain:

    # systemctl stop nginx
     # /opt/certbot-auto certonly --standalone --email aorth@mjanja.ch -d library.cgiar.org
     # systemctl start nginx
    -
    +
  • - +
  • I modified the nginx configuration of the ansible playbooks to use this new certificate and now the certificate is enabled and OCSP stapling is working:

    $ openssl s_client -connect cgspace.cgiar.org:443 -servername library.cgiar.org  -tls1_2 -tlsextdebug -status
     ...
     OCSP Response Data:
     ...
     Cert Status: good
    -
    +
  • + diff --git a/docs/2017-10/index.html b/docs/2017-10/index.html index 625d64b64..d053a969e 100644 --- a/docs/2017-10/index.html +++ b/docs/2017-10/index.html @@ -11,12 +11,11 @@ Peter emailed to point out that many items in the ILRI archive collection have multiple handles: - http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336 - There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine + Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections " /> @@ -31,15 +30,14 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG Peter emailed to point out that many items in the ILRI archive collection have multiple handles: - http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336 - There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine + Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections "/> - + @@ -121,40 +119,38 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG

    2017-10-01

    +
  • Peter emailed to point out that many items in the ILRI archive collection have multiple handles:

    http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
    -
    +
  • -

    2017-10-02

    + +
  • I looked in the logs and saw some LDAP lookup failures due to timeout but also strangely a “no DN found” error:

    2017-10-01 20:24:57,928 WARN  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:ldap_attribute_lookup:type=failed_search javax.naming.CommunicationException\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is java.net.ConnectException\colon; Connection timed out (Connection timed out)]
     2017-10-01 20:22:37,982 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:failed_login:no DN found for user pballantyne
    -
    +
  • - +
  • I thought maybe his account had expired (seeing as it’s was the first of the month) but he says he was finally able to log in today

  • + +
  • The logs for yesterday show fourteen errors related to LDAP auth failures:

    $ grep -c "ldap_authentication:type=failed_auth" dspace.log.2017-10-01
     14
    -
    +
  • -

    2017-10-04

    @@ -162,59 +158,67 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG + +
  • The first is a link to a browse page that should be handled better in nginx:

    http://library.cgiar.org/browse?value=Intellectual%20Assets%20Reports&type=subject → https://cgspace.cgiar.org/browse?value=Intellectual%20Assets%20Reports&type=subject
    -
    +
  • -

    2017-10-05

    + +
  • I had a look at yesterday’s OAI and REST logs in /var/log/nginx but didn’t see anything unusual:

    # awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 10
    -    141 157.55.39.240
    -    145 40.77.167.85
    -    162 66.249.66.92
    -    181 66.249.66.95
    -    211 66.249.66.91
    -    312 66.249.66.94
    -    384 66.249.66.90
    -   1495 50.116.102.77
    -   3904 70.32.83.92
    -   9904 45.5.184.196
    +141 157.55.39.240
    +145 40.77.167.85
    +162 66.249.66.92
    +181 66.249.66.95
    +211 66.249.66.91
    +312 66.249.66.94
    +384 66.249.66.90
    +1495 50.116.102.77
    +3904 70.32.83.92
    +9904 45.5.184.196
     # awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
    -      5 66.249.66.71
    -      6 66.249.66.67
    -      6 68.180.229.31
    -      8 41.84.227.85
    -      8 66.249.66.92
    -     17 66.249.66.65
    -     24 66.249.66.91
    -     38 66.249.66.95
    -     69 66.249.66.90
    -    148 66.249.66.94
    -
    + 5 66.249.66.71 + 6 66.249.66.67 + 6 68.180.229.31 + 8 41.84.227.85 + 8 66.249.66.92 + 17 66.249.66.65 + 24 66.249.66.91 + 38 66.249.66.95 + 69 66.249.66.90 +148 66.249.66.94 +
  • -

    2017-10-06

    @@ -251,19 +255,19 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
  • I tried to submit a “Change of Address” request in the Google Search Console but I need to be an owner on CGSpace’s console (currently I’m just a user) in order to do that
  • Manually clean up some communities and collections that Peter had requested a few weeks ago
  • Delete Community 10568102 (ILRI Research and Development Issues)
  • -
  • Move five collections to 1056827629 (ILRI Projects) using move-collections.sh with the following configuration:
  • - + +
  • Move five collections to 1056827629 (ILRI Projects) using move-collections.sh with the following configuration:

    10568/1637 10568/174 10568/27629
     10568/1642 10568/174 10568/27629
     10568/1614 10568/174 10568/27629
     10568/75561 10568/150 10568/27629
     10568/183 10568/230 10568/27629
    -
    +
  • -

    2017-10-11

    @@ -311,31 +315,34 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
  • In other news, I was trying to look at a question about stats raised by Magdalena and then CGSpace went down due to SQL connection pool
  • Looking at the PostgreSQL activity I see there are 93 connections, but after a minute or two they went down and CGSpace came back up
  • Annnd I reloaded the Atmire Usage Stats module and the connections shot back up and CGSpace went down again
  • -
  • Still not sure where the load is coming from right now, but it’s clear why there were so many alerts yesterday on the 25th!
  • - + +
  • Still not sure where the load is coming from right now, but it’s clear why there were so many alerts yesterday on the 25th!

    # grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-25 | sort -n | uniq | wc -l
     18022
    -
    +
  • - +
  • Compared to other days there were two or three times the number of requests yesterday!

    # grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-23 | sort -n | uniq | wc -l
     3141
     # grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-26 | sort -n | uniq | wc -l
     7851
    -
    +
  • -

    2017-10-27

    @@ -355,133 +362,126 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG + +
  • I don’t see any tell tale signs in the REST or OAI logs, so trying to do rudimentary analysis in DSpace logs:

    # grep '2017-10-29 02:' dspace.log.2017-10-29 | grep -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     2049
    -
    +
  • - +
  • So there were 2049 unique sessions during the hour of 2AM

  • + +
  • Looking at my notes, the number of unique sessions was about the same during the same hour on other days when there were no alerts

  • + +
  • I think I’ll need to enable access logging in nginx to figure out what’s going on

  • + +
  • After enabling logging on requests to XMLUI on / I see some new bot I’ve never seen before:

    137.108.70.6 - - [29/Oct/2017:07:39:49 +0000] "GET /discover?filtertype_0=type&filter_relational_operator_0=equals&filter_0=Internal+Document&filtertype=author&filter_relational_operator=equals&filter=CGIAR+Secretariat HTTP/1.1" 200 7776 "-" "Mozilla/5.0 (compatible; CORE/0.6; +http://core.ac.uk; http://core.ac.uk/intro/contact)"
    -
    +
  • -

    2017-10-30

    + +
  • Uptime Robot noticed that CGSpace went down around 10:15 AM, and I saw that there were 93 PostgreSQL connections:

    dspace=# SELECT * FROM pg_stat_activity;
     ...
     (93 rows)
    -
    +
  • - +
  • Surprise surprise, the CORE bot is likely responsible for the recent load issues, making hundreds of thousands of requests yesterday and today:

    # grep -c "CORE/0.6" /var/log/nginx/access.log 
     26475
     # grep -c "CORE/0.6" /var/log/nginx/access.log.1
     135083
    -
    +
  • - +
  • IP addresses for this bot currently seem to be:

    # grep "CORE/0.6" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq
     137.108.70.6
     137.108.70.7
    -
    +
  • - +
  • I will add their user agent to the Tomcat Session Crawler Valve but it won’t help much because they are only using two sessions:

    # grep 137.108.70 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq
     session_id=5771742CABA3D0780860B8DA81E0551B
     session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
    -
    +
  • - +
  • … and most of their requests are for dynamic discover pages:

    # grep -c 137.108.70 /var/log/nginx/access.log
     26622
     # grep 137.108.70 /var/log/nginx/access.log | grep -c "GET /discover"
     24055
    -
    +
  • - +
  • Just because I’m curious who the top IPs are:

    # awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
    -    496 62.210.247.93
    -    571 46.4.94.226
    -    651 40.77.167.39
    -    763 157.55.39.231
    -    782 207.46.13.90
    -    998 66.249.66.90
    -   1948 104.196.152.243
    -   4247 190.19.92.5
    -  31602 137.108.70.6
    -  31636 137.108.70.7
    -
    +496 62.210.247.93 +571 46.4.94.226 +651 40.77.167.39 +763 157.55.39.231 +782 207.46.13.90 +998 66.249.66.90 +1948 104.196.152.243 +4247 190.19.92.5 +31602 137.108.70.6 +31636 137.108.70.7 +
  • - +
  • At least we know the top two are CORE, but who are the others?

  • + +
  • 190.19.92.5 is apparently in Argentina, and 104.196.152.243 is from Google Cloud Engine

  • + +
  • Actually, these two scrapers might be more responsible for the heavy load than the CORE bot, because they don’t reuse their session variable, creating thousands of new sessions!

    # grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     1419
     # grep 104.196.152.243 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     2811
    -
    +
  • - +
  • From looking at the requests, it appears these are from CIAT and CCAFS

  • + +
  • I wonder if I could somehow instruct them to use a user agent so that we could apply a crawler session manager valve to them

  • + +
  • Actually, according to the Tomcat docs, we could use an IP with crawlerIps: https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve

  • + +
  • Ah, wait, it looks like crawlerIps only came in 2017-06, so probably isn’t in Ubuntu 16.04’s 7.0.68 build!

  • + +
  • That would explain the errors I was getting when trying to set it:

    WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property 'crawlerIps' to '190\.19\.92\.5|104\.196\.152\.243' did not find a matching property.
    -
    +
  • - +
  • As for now, it actually seems the CORE bot coming from 137.108.70.6 and 137.108.70.7 is only using a few sessions per day, which is good:

    # grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)' dspace.log.2017-10-30 | sort -n | uniq -c | sort -h
    -    410 session_id=74F0C3A133DBF1132E7EC30A7E7E0D60:ip_addr=137.108.70.7
    -    574 session_id=5771742CABA3D0780860B8DA81E0551B:ip_addr=137.108.70.7
    -   1012 session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A:ip_addr=137.108.70.6
    -
    +410 session_id=74F0C3A133DBF1132E7EC30A7E7E0D60:ip_addr=137.108.70.7 +574 session_id=5771742CABA3D0780860B8DA81E0551B:ip_addr=137.108.70.7 +1012 session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A:ip_addr=137.108.70.6 +
  • -

    2017-10-31

    @@ -489,40 +489,43 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A + +
  • To follow up on the CORE bot traffic, there were almost 300,000 request yesterday:

    # grep "CORE/0.6" /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h
    - 139109 137.108.70.6
    - 139253 137.108.70.7
    -
    +139109 137.108.70.6 +139253 137.108.70.7 +
  • - +
  • I’ve emailed the CORE people to ask if they can update the repository information from CGIAR Library to CGSpace

  • + +
  • Also, I asked if they could perhaps use the sitemap.xml, OAI-PMH, or REST APIs to index us more efficiently, because they mostly seem to be crawling the nearly endless Discovery facets

  • + +
  • I added GoAccess to the list of package to install in the DSpace role of the Ansible infrastructure scripts

  • + +
  • It makes it very easy to analyze nginx logs from the command line, to see where traffic is coming from:

    # goaccess /var/log/nginx/access.log --log-format=COMBINED
    -
    +
  • - +
  • According to Uptime Robot CGSpace went down and up a few times

  • + +
  • I had a look at goaccess and I saw that CORE was actively indexing

  • + +
  • Also, PostgreSQL connections were at 91 (with the max being 60 per web app, hmmm)

  • + +
  • I’m really starting to get annoyed with these guys, and thinking about blocking their IP address for a few days to see if CGSpace becomes more stable

  • + +
  • Actually, come to think of it, they aren’t even obeying robots.txt, because we actually disallow /discover and /search-filter URLs but they are hitting those massively:

    # grep "CORE/0.6" /var/log/nginx/access.log | grep -o -E "GET /(discover|search-filter)" | sort -n | uniq -c | sort -rn 
    - 158058 GET /discover
    -  14260 GET /search-filter
    -
    +158058 GET /discover +14260 GET /search-filter +
  • - diff --git a/docs/2017-11/index.html b/docs/2017-11/index.html index e272e53fc..b54803fa2 100644 --- a/docs/2017-11/index.html +++ b/docs/2017-11/index.html @@ -17,17 +17,15 @@ The CORE developers responded to say they are looking into their bot not respect Today there have been no hits by CORE and no alerts from Linode (coincidence?) - # grep -c "CORE" /var/log/nginx/access.log 0 - Generate list of authors on CGSpace for Peter to go through and correct: - dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv; COPY 54701 + " /> @@ -47,19 +45,17 @@ The CORE developers responded to say they are looking into their bot not respect Today there have been no hits by CORE and no alerts from Linode (coincidence?) - # grep -c "CORE" /var/log/nginx/access.log 0 - Generate list of authors on CGSpace for Peter to go through and correct: - dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv; COPY 54701 + "/> - + @@ -147,38 +143,40 @@ COPY 54701

    2017-11-02

    +
  • Today there have been no hits by CORE and no alerts from Linode (coincidence?)

    # grep -c "CORE" /var/log/nginx/access.log
     0
    -
    +
  • - +
  • Generate list of authors on CGSpace for Peter to go through and correct:

    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
     COPY 54701
    -
    +
  • + + +
  • Basically, once you read the METS XML for an item it becomes easy to trace the structure to find the bitstream link

    //mets:fileSec/mets:fileGrp[@USE='CONTENT']/mets:file/mets:FLocat[@LOCTYPE='URL']/@xlink:href
    -
    +
  • -

    2017-11-03

    @@ -195,25 +193,23 @@ COPY 54701
  • I corrected about half of the authors to standardize them
  • Linode emailed this morning to say that the CPU usage was high again, this time at 6:14AM
  • It’s the first time in a few days that this has happened
  • -
  • I had a look to see what was going on, but it isn’t the CORE bot:
  • - + +
  • I had a look to see what was going on, but it isn’t the CORE bot:

    # awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
    -    306 68.180.229.31
    -    323 61.148.244.116
    -    414 66.249.66.91
    -    507 40.77.167.16
    -    618 157.55.39.161
    -    652 207.46.13.103
    -    666 157.55.39.254
    -   1173 104.196.152.243
    -   1737 66.249.66.90
    -  23101 138.201.52.218
    -
    +306 68.180.229.31 +323 61.148.244.116 +414 66.249.66.91 +507 40.77.167.16 +618 157.55.39.161 +652 207.46.13.103 +666 157.55.39.254 +1173 104.196.152.243 +1737 66.249.66.90 +23101 138.201.52.218 +
  • - +
  • 138.201.52.218 is from some Hetzner server, and I see it making 40,000 requests yesterday too, but none before that:

    # zgrep -c 138.201.52.218 /var/log/nginx/access.log*
     /var/log/nginx/access.log:24403
    @@ -223,17 +219,14 @@ COPY 54701
     /var/log/nginx/access.log.4.gz:0
     /var/log/nginx/access.log.5.gz:0
     /var/log/nginx/access.log.6.gz:0
    -
    +
  • - +
  • It’s clearly a bot as it’s making tens of thousands of requests, but it’s using a “normal” user agent:

    Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
    -
    +
  • -

    2017-11-05

    @@ -247,28 +240,29 @@ COPY 54701 Add author

    +
  • But in the database the authors are correct (none with weird , / characters):

    dspace=# select distinct text_value, authority, confidence from metadatavalue value where resource_type_id=2 and metadata_field_id=3 and text_value like 'International Livestock Research Institute%';
    -                 text_value                 |              authority               | confidence 
    +             text_value                 |              authority               | confidence 
     --------------------------------------------+--------------------------------------+------------
    - International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c |          0
    - International Livestock Research Institute | f4db1627-47cd-4699-b394-bab7eba6dadc |          0
    - International Livestock Research Institute |                                      |         -1
    - International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c |        600
    - International Livestock Research Institute | f4db1627-47cd-4699-b394-bab7eba6dadc |         -1
    - International Livestock Research Institute |                                      |        600
    - International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c |         -1
    - International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c |        500
    +International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c |          0
    +International Livestock Research Institute | f4db1627-47cd-4699-b394-bab7eba6dadc |          0
    +International Livestock Research Institute |                                      |         -1
    +International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c |        600
    +International Livestock Research Institute | f4db1627-47cd-4699-b394-bab7eba6dadc |         -1
    +International Livestock Research Institute |                                      |        600
    +International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c |         -1
    +International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c |        500
     (8 rows)
    -
    +
  • -

    2017-11-07

    @@ -276,25 +270,23 @@ COPY 54701 + +
  • I will start by looking at bot usage (access.log.1 includes usage until 6AM today):

    # cat /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -    619 65.49.68.184
    -    840 65.49.68.199
    -    924 66.249.66.91
    -   1131 68.180.229.254
    -   1583 66.249.66.90
    -   1953 207.46.13.103
    -   1999 207.46.13.80
    -   2021 157.55.39.161
    -   2034 207.46.13.36
    -   4681 104.196.152.243
    -
    +619 65.49.68.184 +840 65.49.68.199 +924 66.249.66.91 +1131 68.180.229.254 +1583 66.249.66.90 +1953 207.46.13.103 +1999 207.46.13.80 +2021 157.55.39.161 +2034 207.46.13.36 +4681 104.196.152.243 +
  • - +
  • 104.196.152.243 seems to be a top scraper for a few weeks now:

    # zgrep -c 104.196.152.243 /var/log/nginx/access.log*
     /var/log/nginx/access.log:336
    @@ -307,11 +299,9 @@ COPY 54701
     /var/log/nginx/access.log.7.gz:7517
     /var/log/nginx/access.log.8.gz:7211
     /var/log/nginx/access.log.9.gz:2763
    -
    +
  • - +
  • This user is responsible for hundreds and sometimes thousands of Tomcat sessions:

    $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     954
    @@ -319,41 +309,35 @@ $ grep 104.196.152.243 dspace.log.2017-11-03 | grep -o -E 'session_id=[A-Z0-9]{3
     6199
     $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     7051
    -
    +
  • - +
  • The worst thing is that this user never specifies a user agent string so we can’t lump it in with the other bots using the Tomcat Session Crawler Manager Valve

  • + +
  • They don’t request dynamic URLs like “/discover” but they seem to be fetching handles from XMLUI instead of REST (and some with //handle, note the regex below):

    # grep -c 104.196.152.243 /var/log/nginx/access.log.1
     4681
     # grep 104.196.152.243 /var/log/nginx/access.log.1 | grep -c -P 'GET //?handle'
     4618
    -
    +
  • - +
  • I just realized that ciat.cgiar.org points to 104.196.152.243, so I should contact Leroy from CIAT to see if we can change their scraping behavior

  • + +
  • The next IP (207.46.13.36) seem to be Microsoft’s bingbot, but all its requests specify the “bingbot” user agent and there are no requests for dynamic URLs that are forbidden, like “/discover”:

    $ grep -c 207.46.13.36 /var/log/nginx/access.log.1 
     2034
     # grep 207.46.13.36 /var/log/nginx/access.log.1 | grep -c "GET /discover"
     0
    -
    +
  • - +
  • The next IP (157.55.39.161) also seems to be bingbot, and none of its requests are for URLs forbidden by robots.txt either:

    # grep 157.55.39.161 /var/log/nginx/access.log.1 | grep -c "GET /discover"
     0
    -
    +
  • - +
  • The next few seem to be bingbot as well, and they declare a proper user agent and do not request dynamic URLs like “/discover”:

    # grep -c -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1 
     5997
    @@ -361,11 +345,9 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
     5988
     # grep -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c "GET /discover"
     0
    -
    +
  • - +
  • The next few seem to be Googlebot, and they declare a proper user agent and do not request dynamic URLs like “/discover”:

    # grep -c -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1 
     3048
    @@ -373,116 +355,109 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
     3048
     # grep -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c "GET /discover"
     0
    -
    +
  • - +
  • The next seems to be Yahoo, which declares a proper user agent and does not request dynamic URLs like “/discover”:

    # grep -c 68.180.229.254 /var/log/nginx/access.log.1 
     1131
     # grep  68.180.229.254 /var/log/nginx/access.log.1 | grep -c "GET /discover"
     0
    -
    +
  • - +
  • The last of the top ten IPs seems to be some bot with a weird user agent, but they are not behaving too well:

    # grep -c -E '65.49.68.[0-9]{3}' /var/log/nginx/access.log.1 
     2950
     # grep -E '65.49.68.[0-9]{3}' /var/log/nginx/access.log.1 | grep -c "GET /discover"
     330
    -
    +
  • - + +
  • I’ll just keep an eye on that one for now, as it only made a few hundred requests to dynamic discovery URLs

  • + +
  • While it’s not in the top ten, Baidu is one bot that seems to not give a fuck:

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "7/Nov/2017" | grep -c Baiduspider
     8912
     # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "7/Nov/2017" | grep Baiduspider | grep -c -E "GET /(browse|discover|search-filter)"
     2521
    -
    +
  • - +
  • According to their documentation their bot respects robots.txt, but I don’t see this being the case

  • + +
  • I think I will end up blocking Baidu as well…

  • + +
  • Next is for me to look and see what was happening specifically at 3AM and 7AM when the server crashed

  • + +
  • I should look in nginx access.log, rest.log, oai.log, and DSpace’s dspace.log.2017-11-07

  • + +
  • Here are the top IPs making requests to XMLUI from 2 to 8 AM:

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -    279 66.249.66.91
    -    373 65.49.68.199
    -    446 68.180.229.254
    -    470 104.196.152.243
    -    470 197.210.168.174
    -    598 207.46.13.103
    -    603 157.55.39.161
    -    637 207.46.13.80
    -    703 207.46.13.36
    -    724 66.249.66.90
    -
    +279 66.249.66.91 +373 65.49.68.199 +446 68.180.229.254 +470 104.196.152.243 +470 197.210.168.174 +598 207.46.13.103 +603 157.55.39.161 +637 207.46.13.80 +703 207.46.13.36 +724 66.249.66.90 +
  • - +
  • Of those, most are Google, Bing, Yahoo, etc, except 63.143.42.244 and 63.143.42.242 which are Uptime Robot

  • + +
  • Here are the top IPs making requests to REST from 2 to 8 AM:

    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -      8 207.241.229.237
    -     10 66.249.66.90
    -     16 104.196.152.243
    -     25 41.60.238.61
    -     26 157.55.39.161
    -     27 207.46.13.103
    -     27 207.46.13.80
    -     31 207.46.13.36
    -   1498 50.116.102.77
    -
    + 8 207.241.229.237 + 10 66.249.66.90 + 16 104.196.152.243 + 25 41.60.238.61 + 26 157.55.39.161 + 27 207.46.13.103 + 27 207.46.13.80 + 31 207.46.13.36 +1498 50.116.102.77 +
  • - +
  • The OAI requests during that same time period are nothing to worry about:

    # cat /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -      1 66.249.66.92
    -      4 66.249.66.90
    -      6 68.180.229.254
    -
    + 1 66.249.66.92 + 4 66.249.66.90 + 6 68.180.229.254 +
  • - +
  • The top IPs from dspace.log during the 2–8 AM period:

    $ grep -E '2017-11-07 0[2-8]' dspace.log.2017-11-07 | grep -o -E 'ip_addr=[0-9.]+' | sort -n | uniq -c | sort -h | tail
    -    143 ip_addr=213.55.99.121
    -    181 ip_addr=66.249.66.91
    -    223 ip_addr=157.55.39.161
    -    248 ip_addr=207.46.13.80
    -    251 ip_addr=207.46.13.103
    -    291 ip_addr=207.46.13.36
    -    297 ip_addr=197.210.168.174
    -    312 ip_addr=65.49.68.199
    -    462 ip_addr=104.196.152.243
    -    488 ip_addr=66.249.66.90
    -
    +143 ip_addr=213.55.99.121 +181 ip_addr=66.249.66.91 +223 ip_addr=157.55.39.161 +248 ip_addr=207.46.13.80 +251 ip_addr=207.46.13.103 +291 ip_addr=207.46.13.36 +297 ip_addr=197.210.168.174 +312 ip_addr=65.49.68.199 +462 ip_addr=104.196.152.243 +488 ip_addr=66.249.66.90 +
  • - +
  • These aren’t actually very interesting, as the top few are Google, CIAT, Bingbot, and a few other unknown scrapers

  • + +
  • The number of requests isn’t even that high to be honest

  • + +
  • As I was looking at these logs I noticed another heavy user (124.17.34.59) that was not active during this time period, but made many requests today alone:

    # zgrep -c 124.17.34.59 /var/log/nginx/access.log*
     /var/log/nginx/access.log:22581
    @@ -495,189 +470,188 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
     /var/log/nginx/access.log.7.gz:0
     /var/log/nginx/access.log.8.gz:0
     /var/log/nginx/access.log.9.gz:1
    -
    +
  • - +
  • The whois data shows the IP is from China, but the user agent doesn’t really give any clues:

    # grep 124.17.34.59 /var/log/nginx/access.log | awk -F'" ' '{print $3}' | sort | uniq -c | sort -h
    -    210 "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
    -  22610 "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)"
    -
    +210 "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36" +22610 "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)" +
  • - +
  • A Google search for “LCTE bot” doesn’t return anything interesting, but this Stack Overflow discussion references the lack of information

  • + +
  • So basically after a few hours of looking at the log files I am not closer to understanding what is going on!

  • + +
  • I do know that we want to block Baidu, though, as it does not respect robots.txt

  • + +
  • And as we speak Linode alerted that the outbound traffic rate is very high for the past two hours (about 12–14 hours)

  • + +
  • At least for now it seems to be that new Chinese IP (124.17.34.59):

    # grep -E "07/Nov/2017:1[234]:" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -    198 207.46.13.103
    -    203 207.46.13.80
    -    205 207.46.13.36
    -    218 157.55.39.161
    -    249 45.5.184.221
    -    258 45.5.187.130
    -    386 66.249.66.90
    -    410 197.210.168.174
    -   1896 104.196.152.243
    -  11005 124.17.34.59
    -
    +198 207.46.13.103 +203 207.46.13.80 +205 207.46.13.36 +218 157.55.39.161 +249 45.5.184.221 +258 45.5.187.130 +386 66.249.66.90 +410 197.210.168.174 +1896 104.196.152.243 +11005 124.17.34.59 +
  • - +
  • Seems 124.17.34.59 are really downloading all our PDFs, compared to the next top active IPs during this time!

    # grep -E "07/Nov/2017:1[234]:" /var/log/nginx/access.log | grep 124.17.34.59 | grep -c pdf
     5948
     # grep -E "07/Nov/2017:1[234]:" /var/log/nginx/access.log | grep 104.196.152.243 | grep -c pdf
     0
    -
    +
  • - +
  • About CIAT, I think I need to encourage them to specify a user agent string for their requests, because they are not reuising their Tomcat session and they are creating thousands of sessions per day

  • + +
  • All CIAT requests vs unique ones:

    $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-11-07 | wc -l
     3506
     $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-11-07 | sort | uniq | wc -l
     3506
    -
    +
  • -

    Baidu robots.txt tester

    +
  • But they literally just made this request today:

    180.76.15.136 - - [07/Nov/2017:06:25:11 +0000] "GET /discover?filtertype_0=crpsubject&filter_relational_operator_0=equals&filter_0=WATER%2C+LAND+AND+ECOSYSTEMS&filtertype=subject&filter_relational_operator=equals&filter=WATER+RESOURCES HTTP/1.1" 200 82265 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
    -
    +
  • - +
  • Along with another thousand or so requests to URLs that are forbidden in robots.txt today alone:

    # grep -c Baiduspider /var/log/nginx/access.log
     3806
     # grep Baiduspider /var/log/nginx/access.log | grep -c -E "GET /(browse|discover|search-filter)"
     1085
    -
    +
  • - +
  • I will think about blocking their IPs but they have 164 of them!

    # grep "Baiduspider/2.0" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq | wc -l
     164
    -
    +
  • +

    2017-11-08

    + +
  • Jesus, the new Chinese IP (124.17.34.59) has downloaded 24,000 PDFs in the last 24 hours:

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "0[78]/Nov/2017:" | grep 124.17.34.59 | grep -v pdf.jpg | grep -c pdf
     24981
    -
    +
  • - +
  • This is about 20,000 Tomcat sessions:

    $ cat dspace.log.2017-11-07 dspace.log.2017-11-08 | grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=124.17.34.59' | sort | uniq | wc -l
     20733
    -
    +
  • - +
  • I’m getting really sick of this

  • + +
  • Sisay re-uploaded the CIAT records that I had already corrected earlier this week, erasing all my corrections

  • + +
  • I had to re-correct all the publishers, places, names, dates, etc and apply the changes on DSpace Test

  • + +
  • Run system updates on DSpace Test and reboot the server

  • + +
  • Magdalena had written to say that two of their Phase II project tags were missing on CGSpace, so I added them (#346)

  • + +
  • I figured out a way to use nginx’s map function to assign a “bot” user agent to misbehaving clients who don’t define a user agent

  • + +
  • Most bots are automatically lumped into one generic session by Tomcat’s Crawler Session Manager Valve but this only works if their user agent matches a pre-defined regular expression like .*[bB]ot.*

  • + +
  • Some clients send thousands of requests without a user agent which ends up creating thousands of Tomcat sessions, wasting precious memory, CPU, and database resources in the process

  • + +
  • Basically, we modify the nginx config to add a mapping with a modified user agent $ua:

    map $remote_addr $ua {
    -    # 2017-11-08 Random Chinese host grabbing 20,000 PDFs
    -    124.17.34.59     'ChineseBot';
    -    default          $http_user_agent;
    +# 2017-11-08 Random Chinese host grabbing 20,000 PDFs
    +124.17.34.59     'ChineseBot';
    +default          $http_user_agent;
     }
    -
    +
  • - +
  • If the client’s address matches then the user agent is set, otherwise the default $http_user_agent variable is used

  • + +
  • Then, in the server’s / block we pass this header to Tomcat:

    proxy_pass http://tomcat_http;
     proxy_set_header User-Agent $ua;
    -
    +
  • - +
  • Note to self: the $ua variable won’t show up in nginx access logs because the default combined log format doesn’t show it, so don’t run around pulling your hair out wondering with the modified user agents aren’t showing in the logs!

  • + +
  • If a client matching one of these IPs connects without a session, it will be assigned one by the Crawler Session Manager Valve

  • + +
  • You can verify by cross referencing nginx’s access.log and DSpace’s dspace.log.2017-11-08, for example

  • + +
  • I will deploy this on CGSpace later this week

  • + +
  • I am interested to check how this affects the number of sessions used by the CIAT and Chinese bots (see above on 2017-11-07 for example)

  • + +
  • I merged the clickable thumbnails code to 5_x-prod (#347) and will deploy it later along with the new bot mapping stuff (and re-run the Asible nginx and tomcat tags)

  • + +
  • I was thinking about Baidu again and decided to see how many requests they have versus Google to URL paths that are explicitly forbidden in robots.txt:

    # zgrep Baiduspider /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)"
     22229
     # zgrep Googlebot /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)"
     0
    -
    +
  • - +
  • It seems that they rarely even bother checking robots.txt, but Google does multiple times per day!

    # zgrep Baiduspider /var/log/nginx/access.log* | grep -c robots.txt
     14
     # zgrep Googlebot  /var/log/nginx/access.log* | grep -c robots.txt
     1134
    -
    +
  • -

    2017-11-09

    +
  • Awesome, it seems my bot mapping stuff in nginx actually reduced the number of Tomcat sessions used by the CIAT scraper today, total requests and unique sessions:

    # zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '09/Nov/2017' | grep -c 104.196.152.243
     8956
     $ grep 104.196.152.243 dspace.log.2017-11-09 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     223
    -
    +
  • - +
  • Versus the same stats for yesterday and the day before:

    # zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '08/Nov/2017' | grep -c 104.196.152.243 
     10216
    @@ -687,12 +661,13 @@ $ grep 104.196.152.243 dspace.log.2017-11-08 | grep -o -E 'session_id=[A-Z0-9]{3
     8120
     $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     3506
    -
    +
  • -

    2017-11-11

    @@ -707,66 +682,60 @@ $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{3 + +
  • Looking at the top client IPs on CGSpace so far this morning, even though it’s only been eight hours:

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "12/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -    243 5.83.120.111
    -    335 40.77.167.103
    -    424 66.249.66.91
    -    529 207.46.13.36
    -    554 40.77.167.129
    -    604 207.46.13.53
    -    754 104.196.152.243
    -    883 66.249.66.90
    -   1150 95.108.181.88
    -   1381 5.9.6.51
    -
    +243 5.83.120.111 +335 40.77.167.103 +424 66.249.66.91 +529 207.46.13.36 +554 40.77.167.129 +604 207.46.13.53 +754 104.196.152.243 +883 66.249.66.90 +1150 95.108.181.88 +1381 5.9.6.51 +
  • - +
  • 5.9.6.51 seems to be a Russian bot:

    # grep 5.9.6.51 /var/log/nginx/access.log | tail -n 1
     5.9.6.51 - - [12/Nov/2017:08:13:13 +0000] "GET /handle/10568/16515/recent-submissions HTTP/1.1" 200 5097 "-" "Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)"
    -
    +
  • - +
  • What’s amazing is that it seems to reuse its Java session across all requests:

    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2017-11-12
     1558
     $ grep 5.9.6.51 dspace.log.2017-11-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     1
    -
    +
  • - +
  • Bravo to MegaIndex.ru!

  • + +
  • The same cannot be said for 95.108.181.88, which appears to be YandexBot, even though Tomcat’s Crawler Session Manager valve regex should match ‘YandexBot’:

    # grep 95.108.181.88 /var/log/nginx/access.log | tail -n 1
     95.108.181.88 - - [12/Nov/2017:08:33:17 +0000] "GET /bitstream/handle/10568/57004/GenebankColombia_23Feb2015.pdf HTTP/1.1" 200 972019 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
     $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2017-11-12
     991
    -
    +
  • - +
  • Move some items and collections on CGSpace for Peter Ballantyne, running move_collections.sh with the following configuration:

    10947/6    10947/1 10568/83389
     10947/34   10947/1 10568/83389
     10947/2512 10947/1 10568/83389
    -
    +
  • - +
  • I explored nginx rate limits as a way to aggressively throttle Baidu bot which doesn’t seem to respect disallowed URLs in robots.txt

  • + +
  • There’s an interesting blog post from Nginx’s team about rate limiting as well as a clever use of mapping with rate limits

  • + +
  • The solution I came up with uses tricks from both of those

  • + +
  • I deployed the limit on CGSpace and DSpace Test and it seems to work well:

    $ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
     HTTP/1.1 200 OK
    @@ -790,28 +759,27 @@ Content-Length: 206
     Content-Type: text/html
     Date: Sun, 12 Nov 2017 16:30:21 GMT
     Server: nginx
    -
    +
  • -

    2017-11-13

    +
  • At the end of the day I checked the logs and it really looks like the Baidu rate limiting is working, HTTP 200 vs 503:

    # zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "13/Nov/2017" | grep "Baiduspider" | grep -c " 200 "
     1132
     # zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "13/Nov/2017" | grep "Baiduspider" | grep -c " 503 "
     10105
    -
    +
  • -

    2017-11-14

    @@ -838,17 +808,18 @@ Server: nginx
  • They had been waiting on a branch for a few months and I think I just forgot about them
  • I have been running them on DSpace Test for a few days and haven’t seen any issues there
  • Started testing DSpace 6.2 and a few things have changed
  • -
  • Now PostgreSQL needs pgcrypto:
  • - + +
  • Now PostgreSQL needs pgcrypto:

    $ psql dspace6
     dspace6=# CREATE EXTENSION pgcrypto;
    -
    +
  • -

    2017-11-15

    @@ -865,46 +836,47 @@ dspace6=# CREATE EXTENSION pgcrypto;
  • Uptime Robot said that CGSpace went down today and I see lots of Timeout waiting for idle object errors in the DSpace logs
  • I looked in PostgreSQL using SELECT * FROM pg_stat_activity; and saw that there were 73 active connections
  • After a few minutes the connecitons went down to 44 and CGSpace was kinda back up, it seems like Tsega restarted Tomcat
  • -
  • Looking at the REST and XMLUI log files, I don’t see anything too crazy:
  • - + +
  • Looking at the REST and XMLUI log files, I don’t see anything too crazy:

    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep "17/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -     13 66.249.66.223
    -     14 207.46.13.36
    -     17 207.46.13.137
    -     22 207.46.13.23
    -     23 66.249.66.221
    -     92 66.249.66.219
    -    187 104.196.152.243
    -   1400 70.32.83.92
    -   1503 50.116.102.77
    -   6037 45.5.184.196
    + 13 66.249.66.223
    + 14 207.46.13.36
    + 17 207.46.13.137
    + 22 207.46.13.23
    + 23 66.249.66.221
    + 92 66.249.66.219
    +187 104.196.152.243
    +1400 70.32.83.92
    +1503 50.116.102.77
    +6037 45.5.184.196
     # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "17/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -    325 139.162.247.24
    -    354 66.249.66.223
    -    422 207.46.13.36
    -    434 207.46.13.23
    -    501 207.46.13.137
    -    647 66.249.66.221
    -    662 34.192.116.178
    -    762 213.55.99.121
    -   1867 104.196.152.243
    -   2020 66.249.66.219
    -
    +325 139.162.247.24 +354 66.249.66.223 +422 207.46.13.36 +434 207.46.13.23 +501 207.46.13.137 +647 66.249.66.221 +662 34.192.116.178 +762 213.55.99.121 +1867 104.196.152.243 +2020 66.249.66.219 +
  • - +
  • I need to look into using JMX to analyze active sessions I think, rather than looking at log files

  • + +
  • After adding appropriate JMX listener options to Tomcat’s JAVA_OPTS and restarting Tomcat, I can connect remotely using an SSH dynamic port forward (SOCKS) on port 7777 for example, and then start jconsole locally like:

    $ jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=7777 service:jmx:rmi:///jndi/rmi://localhost:9000/jmxrmi -J-DsocksNonProxyHosts=
    -
    +
  • -

    Jconsole sessions for XMLUI

    @@ -917,51 +889,46 @@ dspace6=# CREATE EXTENSION pgcrypto; + +
  • Looking in the nginx access logs I see the most active XMLUI users between 4 and 6 AM:

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "19/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -    111 66.249.66.155
    -    171 5.9.6.51
    -    188 54.162.241.40
    -    229 207.46.13.23
    -    233 207.46.13.137
    -    247 40.77.167.6
    -    251 207.46.13.36
    -    275 68.180.229.254
    -    325 104.196.152.243
    -   1610 66.249.66.153
    -
    +111 66.249.66.155 +171 5.9.6.51 +188 54.162.241.40 +229 207.46.13.23 +233 207.46.13.137 +247 40.77.167.6 +251 207.46.13.36 +275 68.180.229.254 +325 104.196.152.243 +1610 66.249.66.153 +
  • - +
  • 66.249.66.153 appears to be Googlebot:

    66.249.66.153 - - [19/Nov/2017:06:26:01 +0000] "GET /handle/10568/2203 HTTP/1.1" 200 6309 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
    -
    +
  • - +
  • We know Googlebot is persistent but behaves well, so I guess it was just a coincidence that it came at a time when we had other traffic and server activity

  • + +
  • In related news, I see an Atmire update process going for many hours and responsible for hundreds of thousands of log entries (two thirds of all log entries)

    $ wc -l dspace.log.2017-11-19 
     388472 dspace.log.2017-11-19
     $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19 
     267494
    -
    +
  • - +
  • WTF is this process doing every day, and for so many hours?

  • + +
  • In unrelated news, when I was looking at the DSpace logs I saw a bunch of errors like this:

    2017-11-19 03:00:32,806 INFO  org.apache.pdfbox.pdfparser.PDFParser @ Document is encrypted
     2017-11-19 03:00:32,807 ERROR org.apache.pdfbox.filter.FlateFilter @ FlateFilter: stop reading corrupt stream due to a DataFormatException
    -
    +
  • -

    Tomcat G1GC

    @@ -977,35 +944,35 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19

    2017-11-21

    +
  • Magdalena was having problems logging in via LDAP and it seems to be a problem with the CGIAR LDAP server:

    2017-11-21 11:11:09,621 WARN  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=2FEC0E5286C17B6694567FFD77C3171C:ip_addr=77.241.141.58:ldap_authentication:type=failed_auth javax.naming.CommunicationException\colon; simple bind failed\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is javax.net.ssl.SSLHandshakeException\colon; sun.security.validator.ValidatorException\colon; PKIX path validation failed\colon; java.security.cert.CertPathValidatorException\colon; validity check failed]
    -
    +
  • +

    2017-11-22

    + +
  • The logs don’t show anything particularly abnormal between those hours:

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "22/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -    136 31.6.77.23
    -    174 68.180.229.254
    -    217 66.249.66.91
    -    256 157.55.39.79
    -    268 54.144.57.183
    -    281 207.46.13.137
    -    282 207.46.13.36
    -    290 207.46.13.23
    -    696 66.249.66.90
    -    707 104.196.152.243
    -
    +136 31.6.77.23 +174 68.180.229.254 +217 66.249.66.91 +256 157.55.39.79 +268 54.144.57.183 +281 207.46.13.137 +282 207.46.13.36 +290 207.46.13.23 +696 66.249.66.90 +707 104.196.152.243 +
  • -

    Tomcat JVM with CMS GC

    @@ -1014,55 +981,56 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19 + +
  • I see a lot of Googlebot (66.249.66.90) in the XMLUI access logs

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "23/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -     88 66.249.66.91
    -    140 68.180.229.254
    -    155 54.196.2.131
    -    182 54.224.164.166
    -    301 157.55.39.79
    -    315 207.46.13.36
    -    331 207.46.13.23
    -    358 207.46.13.137
    -    565 104.196.152.243
    -   1570 66.249.66.90
    -
    + 88 66.249.66.91 +140 68.180.229.254 +155 54.196.2.131 +182 54.224.164.166 +301 157.55.39.79 +315 207.46.13.36 +331 207.46.13.23 +358 207.46.13.137 +565 104.196.152.243 +1570 66.249.66.90 +
  • - +
  • … and the usual REST scrapers from CIAT (45.5.184.196) and CCAFS (70.32.83.92):

    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E "23/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -      5 190.120.6.219
    -      6 104.198.9.108
    -     14 104.196.152.243
    -     21 112.134.150.6
    -     22 157.55.39.79
    -     22 207.46.13.137
    -     23 207.46.13.36
    -     26 207.46.13.23
    -    942 45.5.184.196
    -   3995 70.32.83.92
    -
    + 5 190.120.6.219 + 6 104.198.9.108 + 14 104.196.152.243 + 21 112.134.150.6 + 22 157.55.39.79 + 22 207.46.13.137 + 23 207.46.13.36 + 26 207.46.13.23 +942 45.5.184.196 +3995 70.32.83.92 +
  • - +
  • These IPs crawling the REST API don’t specify user agents and I’d assume they are creating many Tomcat sessions

  • + +
  • I would catch them in nginx to assign a “bot” user agent to them so that the Tomcat Crawler Session Manager valve could deal with them, but they seem to create any really — at least not in the dspace.log:

    $ grep 70.32.83.92 dspace.log.2017-11-23 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     2
    -
    +
  • -

    2017-11-24

    @@ -1083,83 +1051,84 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
  • I just realized that we’re not logging access requests to other vhosts on CGSpace, so it’s possible I have no idea that we’re getting slammed at 4AM on another domain that we’re just silently redirecting to cgspace.cgiar.org
  • I’ve enabled logging on the CGIAR Library on CGSpace so I can check to see if there are many requests there
  • In just a few seconds I already see a dozen requests from Googlebot (of course they get HTTP 301 redirects to cgspace.cgiar.org)
  • -
  • I also noticed that CGNET appears to be monitoring the old domain every few minutes:
  • - + +
  • I also noticed that CGNET appears to be monitoring the old domain every few minutes:

    192.156.137.184 - - [24/Nov/2017:20:33:58 +0000] "HEAD / HTTP/1.1" 301 0 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.27.1 zlib/1.2.3 libidn/1.18 libssh2/1.4.2"
    -
    +
  • -

    2017-11-26

    + +
  • Yet another mystery because the load for all domains looks fine at that time:

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "26/Nov/2017:0[567]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -    190 66.249.66.83
    -    195 104.196.152.243
    -    220 40.77.167.82
    -    246 207.46.13.137
    -    247 68.180.229.254
    -    257 157.55.39.214
    -    289 66.249.66.91
    -    298 157.55.39.206
    -    379 66.249.66.70
    -   1855 66.249.66.90
    -
    +190 66.249.66.83 +195 104.196.152.243 +220 40.77.167.82 +246 207.46.13.137 +247 68.180.229.254 +257 157.55.39.214 +289 66.249.66.91 +298 157.55.39.206 +379 66.249.66.70 +1855 66.249.66.90 +
  • +

    2017-11-29

    + +
  • Here are all the top XMLUI and REST users from today:

    # cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "29/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -    540 66.249.66.83
    -    659 40.77.167.36
    -    663 157.55.39.214
    -    681 157.55.39.206
    -    733 157.55.39.158
    -    850 66.249.66.70
    -   1311 66.249.66.90
    -   1340 104.196.152.243
    -   4008 70.32.83.92
    -   6053 45.5.184.196
    -
    +540 66.249.66.83 +659 40.77.167.36 +663 157.55.39.214 +681 157.55.39.206 +733 157.55.39.158 +850 66.249.66.70 +1311 66.249.66.90 +1340 104.196.152.243 +4008 70.32.83.92 +6053 45.5.184.196 +
  • - +
  • PostgreSQL activity shows 69 connections

  • + +
  • I don’t have time to troubleshoot more as I’m in Nairobi working on the HPC so I just restarted Tomcat for now

  • + +
  • A few hours later Uptime Robot says the server is down again

  • + +
  • I don’t see much activity in the logs but there are 87 PostgreSQL connections

  • + +
  • But shit, there were 10,000 unique Tomcat sessions today:

    $ cat dspace.log.2017-11-29 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     10037
    -
    +
  • - +
  • Although maybe that’s not much, as the previous two days had more:

    $ cat dspace.log.2017-11-27 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     12377
     $ cat dspace.log.2017-11-28 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     16984
    -
    +
  • -

    2017-11-30

    diff --git a/docs/2017-12/index.html b/docs/2017-12/index.html index fee850261..a6b80d70e 100644 --- a/docs/2017-12/index.html +++ b/docs/2017-12/index.html @@ -29,7 +29,7 @@ The logs say “Timeout waiting for idle object” PostgreSQL activity says there are 115 connections currently The list of connections to XMLUI and REST API for today: "/> - + @@ -131,59 +131,54 @@ The list of connections to XMLUI and REST API for today: +
  • The number of DSpace sessions isn’t even that high:

    $ cat /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     5815
    -
    +
  • - +
  • Connections in the last two hours:

    # cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "1/Dec/2017:(09|10)" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail                                                      
    -     78 93.160.60.22
    -    101 40.77.167.122
    -    113 66.249.66.70
    -    129 157.55.39.206
    -    130 157.55.39.235
    -    135 40.77.167.58
    -    164 68.180.229.254
    -    177 87.100.118.220
    -    188 66.249.66.90
    -    314 2.86.122.76
    -
    + 78 93.160.60.22 +101 40.77.167.122 +113 66.249.66.70 +129 157.55.39.206 +130 157.55.39.235 +135 40.77.167.58 +164 68.180.229.254 +177 87.100.118.220 +188 66.249.66.90 +314 2.86.122.76 +
  • - +
  • What the fuck is going on?

  • + +
  • I’ve never seen this 2.86.122.76 before, it has made quite a few unique Tomcat sessions today:

    $ grep 2.86.122.76 /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     822
    -
    +
  • - +
  • Appears to be some new bot:

    2.86.122.76 - - [01/Dec/2017:09:02:53 +0000] "GET /handle/10568/78444?show=full HTTP/1.1" 200 29307 "-" "Mozilla/3.0 (compatible; Indy Library)"
    -
    +
  • - +
  • I restarted Tomcat and everything came back up

  • + +
  • I can add Indy Library to the Tomcat crawler session manager valve but it would be nice if I could simply remap the useragent in nginx

  • + +
  • I will also add ‘Drupal’ to the Tomcat crawler session manager valve because there are Drupals out there harvesting and they should be considered as bots

    # cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "1/Dec/2017" | grep Drupal | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -      3 54.75.205.145
    -      6 70.32.83.92
    -     14 2a01:7e00::f03c:91ff:fe18:7396
    -     46 2001:4b99:1:1:216:3eff:fe2c:dc6c
    -    319 2001:4b99:1:1:216:3eff:fe76:205b
    -
    + 3 54.75.205.145 + 6 70.32.83.92 + 14 2a01:7e00::f03c:91ff:fe18:7396 + 46 2001:4b99:1:1:216:3eff:fe2c:dc6c +319 2001:4b99:1:1:216:3eff:fe76:205b +
  • +

    2017-12-03

    @@ -225,24 +220,23 @@ The list of connections to XMLUI and REST API for today:
  • Uptime Robot alerted that the server went down and up around 8:53 this morning
  • Uptime Robot alerted that CGSpace was down and up again a few minutes later
  • I don’t see any errors in the DSpace logs but I see in nginx’s access.log that UptimeRobot was returned with HTTP 499 status (Client Closed Request)
  • -
  • Looking at the REST API logs I see some new client IP I haven’t noticed before:
  • - + +
  • Looking at the REST API logs I see some new client IP I haven’t noticed before:

    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E "6/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -     18 95.108.181.88
    -     19 68.180.229.254
    -     30 207.46.13.151
    -     33 207.46.13.110
    -     38 40.77.167.20
    -     41 157.55.39.223
    -     82 104.196.152.243
    -   1529 50.116.102.77
    -   4005 70.32.83.92
    -   6045 45.5.184.196
    -
    + 18 95.108.181.88 + 19 68.180.229.254 + 30 207.46.13.151 + 33 207.46.13.110 + 38 40.77.167.20 + 41 157.55.39.223 + 82 104.196.152.243 +1529 50.116.102.77 +4005 70.32.83.92 +6045 45.5.184.196 +
  • -

    2017-12-07

    @@ -252,56 +246,51 @@ The list of connections to XMLUI and REST API for today:
  • At one point Tsega restarted Tomcat
  • I never got any alerts about high load from Linode though…
  • I looked just now and see that there are 121 PostgreSQL connections!
  • -
  • The top users right now are:
  • - + +
  • The top users right now are:

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "7/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 
    -    838 40.77.167.11
    -    939 66.249.66.223
    -   1149 66.249.66.206
    -   1316 207.46.13.110
    -   1322 207.46.13.151
    -   1323 2001:da8:203:2224:c912:1106:d94f:9189
    -   1414 157.55.39.223
    -   2378 104.196.152.243
    -   2662 66.249.66.219
    -   5110 124.17.34.60
    -
    +838 40.77.167.11 +939 66.249.66.223 +1149 66.249.66.206 +1316 207.46.13.110 +1322 207.46.13.151 +1323 2001:da8:203:2224:c912:1106:d94f:9189 +1414 157.55.39.223 +2378 104.196.152.243 +2662 66.249.66.219 +5110 124.17.34.60 +
  • - +
  • We’ve never seen 124.17.34.60 yet, but it’s really hammering us!

  • + +
  • Apparently it is from China, and here is one of its user agents:

    Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)
    -
    +
  • - +
  • It is responsible for 4,500 Tomcat sessions today alone:

    $ grep 124.17.34.60 /home/cgspace.cgiar.org/log/dspace.log.2017-12-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     4574
    -
    +
  • - +
  • I’ve adjusted the nginx IP mapping that I set up last month to account for 124.17.34.60 and 124.17.34.59 using a regex, as it’s the same bot on the same subnet

  • + +
  • I was running the DSpace cleanup task manually and it hit an error:

    $ /home/cgspace.cgiar.org/bin/dspace cleanup -v
     ...
     Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -  Detail: Key (bitstream_id)=(144666) is still referenced from table "bundle".
    -
    +Detail: Key (bitstream_id)=(144666) is still referenced from table "bundle". +
  • - +
  • The solution is like I discovered in 2017-04, to set the primary_bitstream_id to null:

    dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (144666);
     UPDATE 1
    -
    +
  • +

    2017-12-13

    @@ -338,15 +327,13 @@ UPDATE 1
  • Something weird going on with duplicate authors that have the same text value, like Berto, Jayson C. and Balmeo, Katherine P.
  • I will send her feedback on some author names like UNEP and ICRISAT and ask her for the missing thumbnail11.jpg
  • -
  • I did a test import of the data locally after building with SAFBuilder but for some reason I had to specify the collection (even though the collections were specified in the collection field)
  • - + +
  • I did a test import of the data locally after building with SAFBuilder but for some reason I had to specify the collection (even though the collections were specified in the collection field)

    $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/89338 --source /Users/aorth/Downloads/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat --mapfile=/tmp/ccafs.map &> /tmp/ccafs.log
    -
    +
  • - +
  • It’s the same on DSpace Test, I can’t import the SAF bundle without specifying the collection:

    $ dspace import --add --eperson=aorth@mjanja.ch --mapfile=/tmp/ccafs.map --source=/tmp/ccafs-2016/SimpleArchiveFormat
     No collections given. Assuming 'collections' file inside item directory
    @@ -355,155 +342,150 @@ Generating mapfile: /tmp/ccafs.map
     Processing collections file: collections
     Adding item from directory item_1
     java.lang.NullPointerException
    -        at org.dspace.app.itemimport.ItemImport.addItem(ItemImport.java:865)
    -        at org.dspace.app.itemimport.ItemImport.addItems(ItemImport.java:736)
    -        at org.dspace.app.itemimport.ItemImport.main(ItemImport.java:498)
    -        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -        at java.lang.reflect.Method.invoke(Method.java:498)
    -        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    -        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    +    at org.dspace.app.itemimport.ItemImport.addItem(ItemImport.java:865)
    +    at org.dspace.app.itemimport.ItemImport.addItems(ItemImport.java:736)
    +    at org.dspace.app.itemimport.ItemImport.main(ItemImport.java:498)
    +    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    +    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    +    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    +    at java.lang.reflect.Method.invoke(Method.java:498)
    +    at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    +    at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
     java.lang.NullPointerException
     Started: 1513521856014
     Ended: 1513521858573
     Elapsed time: 2 secs (2559 msecs)
    -
    +
  • - +
  • I even tried to debug it by adding verbose logging to the JAVA_OPTS:

    -Dlog4j.configuration=file:/Users/aorth/dspace/config/log4j-console.properties -Ddspace.log.init.disable=true
    -
    +
  • - +
  • … but the error message was the same, just with more INFO noise around it

  • + +
  • For now I’ll import into a collection in DSpace Test but I’m really not sure what’s up with this!

  • + +
  • Linode alerted that CGSpace was using high CPU from 4 to 6 PM

  • + +
  • The logs for today show the CORE bot (137.108.70.7) being active in XMLUI:

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "17/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -    671 66.249.66.70
    -    885 95.108.181.88
    -    904 157.55.39.96
    -    923 157.55.39.179
    -   1159 207.46.13.107
    -   1184 104.196.152.243
    -   1230 66.249.66.91
    -   1414 68.180.229.254
    -   4137 66.249.66.90
    -  46401 137.108.70.7
    -
    +671 66.249.66.70 +885 95.108.181.88 +904 157.55.39.96 +923 157.55.39.179 +1159 207.46.13.107 +1184 104.196.152.243 +1230 66.249.66.91 +1414 68.180.229.254 +4137 66.249.66.90 +46401 137.108.70.7 +
  • - +
  • And then some CIAT bot (45.5.184.196) is actively hitting API endpoints:

    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "17/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -     33 68.180.229.254
    -     48 157.55.39.96
    -     51 157.55.39.179
    -     56 207.46.13.107
    -    102 104.196.152.243
    -    102 66.249.66.90
    -    691 137.108.70.7
    -   1531 50.116.102.77
    -   4014 70.32.83.92
    -  11030 45.5.184.196
    -
    + 33 68.180.229.254 + 48 157.55.39.96 + 51 157.55.39.179 + 56 207.46.13.107 +102 104.196.152.243 +102 66.249.66.90 +691 137.108.70.7 +1531 50.116.102.77 +4014 70.32.83.92 +11030 45.5.184.196 +
  • -

    2017-12-18

    + +
  • The XMLUI logs show that the CORE bot from last night (137.108.70.7) is very active still:

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -    190 207.46.13.146
    -    191 197.210.168.174
    -    202 86.101.203.216
    -    268 157.55.39.134
    -    297 66.249.66.91
    -    314 213.55.99.121
    -    402 66.249.66.90
    -    532 68.180.229.254
    -    644 104.196.152.243
    -  32220 137.108.70.7
    -
    +190 207.46.13.146 +191 197.210.168.174 +202 86.101.203.216 +268 157.55.39.134 +297 66.249.66.91 +314 213.55.99.121 +402 66.249.66.90 +532 68.180.229.254 +644 104.196.152.243 +32220 137.108.70.7 +
  • - +
  • On the API side (REST and OAI) there is still the same CIAT bot (45.5.184.196) from last night making quite a number of requests this morning:

    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -      7 104.198.9.108
    -      8 185.29.8.111
    -      8 40.77.167.176
    -      9 66.249.66.91
    -      9 68.180.229.254
    -     10 157.55.39.134
    -     15 66.249.66.90
    -     59 104.196.152.243
    -   4014 70.32.83.92
    -   8619 45.5.184.196
    -
    + 7 104.198.9.108 + 8 185.29.8.111 + 8 40.77.167.176 + 9 66.249.66.91 + 9 68.180.229.254 + 10 157.55.39.134 + 15 66.249.66.90 + 59 104.196.152.243 +4014 70.32.83.92 +8619 45.5.184.196 +
  • - +
  • I need to keep an eye on this issue because it has nice fixes for reducing the number of database connections in DSpace 5.7: https://jira.duraspace.org/browse/DS-3551

  • + +
  • Update text on CGSpace about page to give some tips to developers about using the resources more wisely (#352)

  • + +
  • Linode alerted that CGSpace was using 396.3% CPU from 12 to 2 PM

  • + +
  • The REST and OAI API logs look pretty much the same as earlier this morning, but there’s a new IP harvesting XMLUI:

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail            
    -    360 95.108.181.88
    -    477 66.249.66.90
    -    526 86.101.203.216
    -    691 207.46.13.13
    -    698 197.210.168.174
    -    819 207.46.13.146
    -    878 68.180.229.254
    -   1965 104.196.152.243
    -  17701 2.86.72.181
    -  52532 137.108.70.7
    -
    +360 95.108.181.88 +477 66.249.66.90 +526 86.101.203.216 +691 207.46.13.13 +698 197.210.168.174 +819 207.46.13.146 +878 68.180.229.254 +1965 104.196.152.243 +17701 2.86.72.181 +52532 137.108.70.7 +
  • - +
  • 2.86.72.181 appears to be from Greece, and has the following user agent:

    Mozilla/3.0 (compatible; Indy Library)
    -
    +
  • - +
  • Surprisingly it seems they are re-using their Tomcat session for all those 17,000 requests:

    $ grep 2.86.72.181 dspace.log.2017-12-18 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l                                                                                          
     1
    -
    +
  • - +
  • I guess there’s nothing I can do to them for now

  • + +
  • In other news, I am curious how many PostgreSQL connection pool errors we’ve had in the last month:

    $ grep -c "Cannot get a connection, pool error Timeout waiting for idle object" dspace.log.2017-1* | grep -v :0
     dspace.log.2017-11-07:15695
    @@ -515,35 +497,49 @@ dspace.log.2017-11-29:3972
     dspace.log.2017-12-01:1601
     dspace.log.2017-12-02:1274
     dspace.log.2017-12-07:2769
    -
    +
  • - +
  • I made a small fix to my move-collections.sh script so that it handles the case when a “to” or “from” community doesn’t exist

  • + +
  • The script lives here: https://gist.github.com/alanorth/e60b530ed4989df0c731afbb0c640515

  • + +
  • Major reorganization of four of CTA’s French collections

  • + +
  • Basically moving their items into the English ones, then moving the English ones to the top-level of the CTA community, and deleting the old sub-communities

  • + +
  • Move collection 1056851821 from 1056842212 to 1056842211

  • + +
  • Move collection 1056851400 from 1056842214 to 1056842211

  • + +
  • Move collection 1056856992 from 1056842216 to 1056842211

  • + +
  • Move collection 1056842218 from 1056842217 to 1056842211

  • + +
  • Export CSV of collection 1056863484 and move items to collection 1056851400

  • + +
  • Export CSV of collection 1056864403 and move items to collection 1056856992

  • + +
  • Export CSV of collection 1056856994 and move items to collection 1056842218

  • + +
  • There are blank lines in this metadata, which causes DSpace to not detect changes in the CSVs

  • + +
  • I had to use OpenRefine to remove all columns from the CSV except id and collection, and then update the collection field for the new mappings

  • + +
  • Remove empty sub-communities: 1056842212, 1056842214, 1056842216, 1056842217

  • + +
  • I was in the middle of applying the metadata imports on CGSpace and the system ran out of PostgreSQL connections…

  • + +
  • There were 128 PostgreSQL connections at the time… grrrr.

  • + +
  • So I restarted Tomcat 7 and restarted the imports

  • + +
  • I assume the PostgreSQL transactions were fine but I will remove the Discovery index for their community and re-run the light-weight indexing to hopefully re-construct everything:

    $ dspace index-discovery -r 10568/42211
     $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
    -
    +
  • -

    2017-12-19

    @@ -562,86 +558,82 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
  • I notice that this number will be set to 10 by default in DSpace 6.1 and 7.0: https://jira.duraspace.org/browse/DS-3564
  • So I’m going to reduce ours from 20 to 10 and start trying to figure out how the hell to supply a database pool using Tomcat JNDI
  • I re-deployed the 5_x-prod branch on CGSpace, applied all system updates, and restarted the server
  • -
  • Looking through the dspace.log I see this error:
  • - + +
  • Looking through the dspace.log I see this error:

    2017-12-19 08:17:15,740 ERROR org.dspace.statistics.SolrLogger @ Error CREATEing SolrCore 'statistics-2010': Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
    -
    +
  • - +
  • I don’t have time now to look into this but the Solr sharding has long been an issue!

  • + +
  • Looking into using JDBC / JNDI to provide a database pool to DSpace

  • + +
  • The DSpace 6.x configuration docs have more notes about setting up the database pool than the 5.x ones (which actually have none!)

  • + +
  • First, I uncomment db.jndi in dspace/config/dspace.cfg

  • + +
  • Then I create a global Resource in the main Tomcat server.xml (inside GlobalNamingResources):

    <Resource name="jdbc/dspace" auth="Container" type="javax.sql.DataSource"
     	  driverClassName="org.postgresql.Driver"
     	  url="jdbc:postgresql://localhost:5432/dspace"
     	  username="dspace"
     	  password="dspace"
    -      initialSize='5'
    -      maxActive='50'
    -      maxIdle='15'
    -      minIdle='5'
    -      maxWait='5000'
    -      validationQuery='SELECT 1'
    -      testOnBorrow='true' />
    -
    + initialSize='5' + maxActive='50' + maxIdle='15' + minIdle='5' + maxWait='5000' + validationQuery='SELECT 1' + testOnBorrow='true' /> +
  • - +
  • Most of the parameters are from comments by Mark Wood about his JNDI setup: https://jira.duraspace.org/browse/DS-3564

  • + +
  • Then I add a ResourceLink to each web application context:

    <ResourceLink global="jdbc/dspace" name="jdbc/dspace" type="javax.sql.DataSource"/>
    -
    +
  • - +
  • I am not sure why several guides show configuration snippets for server.xml and web application contexts that use a Local and Global jdbc…

  • + +
  • When DSpace can’t find the JNDI context (for whatever reason) you will see this in the dspace logs:

    2017-12-19 13:12:08,796 ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspace
     javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Context. Unable to find [jdbc].
    -        at org.apache.naming.NamingContext.lookup(NamingContext.java:825)
    -        at org.apache.naming.NamingContext.lookup(NamingContext.java:173)
    -        at org.dspace.storage.rdbms.DatabaseManager.initDataSource(DatabaseManager.java:1414)
    -        at org.dspace.storage.rdbms.DatabaseManager.initialize(DatabaseManager.java:1331)
    -        at org.dspace.storage.rdbms.DatabaseManager.getDataSource(DatabaseManager.java:648)
    -        at org.dspace.storage.rdbms.DatabaseManager.getConnection(DatabaseManager.java:627)
    -        at org.dspace.core.Context.init(Context.java:121)
    -        at org.dspace.core.Context.<init>(Context.java:95)
    -        at org.dspace.app.util.AbstractDSpaceWebapp.register(AbstractDSpaceWebapp.java:79)
    -        at org.dspace.app.util.DSpaceContextListener.contextInitialized(DSpaceContextListener.java:128)
    -        at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:5110)
    -        at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5633)
    -        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:145)
    -        at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:1015)
    -        at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:991)
    -        at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:652)
    -        at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:712)
    -        at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:2002)
    -        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    -        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    -        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    -        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    -        at java.lang.Thread.run(Thread.java:748)
    +    at org.apache.naming.NamingContext.lookup(NamingContext.java:825)
    +    at org.apache.naming.NamingContext.lookup(NamingContext.java:173)
    +    at org.dspace.storage.rdbms.DatabaseManager.initDataSource(DatabaseManager.java:1414)
    +    at org.dspace.storage.rdbms.DatabaseManager.initialize(DatabaseManager.java:1331)
    +    at org.dspace.storage.rdbms.DatabaseManager.getDataSource(DatabaseManager.java:648)
    +    at org.dspace.storage.rdbms.DatabaseManager.getConnection(DatabaseManager.java:627)
    +    at org.dspace.core.Context.init(Context.java:121)
    +    at org.dspace.core.Context.<init>(Context.java:95)
    +    at org.dspace.app.util.AbstractDSpaceWebapp.register(AbstractDSpaceWebapp.java:79)
    +    at org.dspace.app.util.DSpaceContextListener.contextInitialized(DSpaceContextListener.java:128)
    +    at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:5110)
    +    at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5633)
    +    at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:145)
    +    at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:1015)
    +    at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:991)
    +    at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:652)
    +    at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:712)
    +    at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:2002)
    +    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    +    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    +    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    +    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    +    at java.lang.Thread.run(Thread.java:748)
     2017-12-19 13:12:08,798 INFO  org.dspace.storage.rdbms.DatabaseManager @ Unable to locate JNDI dataSource: jdbc/dspace
     2017-12-19 13:12:08,798 INFO  org.dspace.storage.rdbms.DatabaseManager @ Falling back to creating own Database pool
    -
    +
  • - +
  • And indeed the Catalina logs show that it failed to set up the JDBC driver:

    org.apache.tomcat.dbcp.dbcp.SQLNestedException: Cannot load JDBC driver class 'org.postgresql.Driver'
    -
    +
  • - +
  • There are several copies of the PostgreSQL driver installed by DSpace:

    $ find ~/dspace/ -iname "postgresql*jdbc*.jar"
     /Users/aorth/dspace/webapps/jspui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
    @@ -649,43 +641,41 @@ javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Cont
     /Users/aorth/dspace/webapps/xmlui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
     /Users/aorth/dspace/webapps/rest/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
     /Users/aorth/dspace/lib/postgresql-9.1-901-1.jdbc4.jar
    -
    +
  • - +
  • These apparently come from the main DSpace pom.xml:

    <dependency>
    -   <groupId>postgresql</groupId>
    -   <artifactId>postgresql</artifactId>
    -   <version>9.1-901-1.jdbc4</version>
    +<groupId>postgresql</groupId>
    +<artifactId>postgresql</artifactId>
    +<version>9.1-901-1.jdbc4</version>
     </dependency>
    -
    +
  • - +
  • So WTF? Let’s try copying one to Tomcat’s lib folder and restarting Tomcat:

    $ cp ~/dspace/lib/postgresql-9.1-901-1.jdbc4.jar /usr/local/opt/tomcat@7/libexec/lib
    -
    +
  • - +
  • Oh that’s fantastic, now at least Tomcat doesn’t print an error during startup so I guess it succeeds to create the JNDI pool

  • + +
  • DSpace starts up but I have no idea if it’s using the JNDI configuration because I see this in the logs:

    2017-12-19 13:26:54,271 INFO  org.dspace.storage.rdbms.DatabaseManager @ DBMS is '{}'PostgreSQL
     2017-12-19 13:26:54,277 INFO  org.dspace.storage.rdbms.DatabaseManager @ DBMS driver version is '{}'9.5.10
     2017-12-19 13:26:54,293 INFO  org.dspace.storage.rdbms.DatabaseUtils @ Loading Flyway DB migrations from: filesystem:/Users/aorth/dspace/etc/postgres, classpath:org.dspace.storage.rdbms.sqlmigration.postgres, classpath:org.dspace.storage.rdbms.migration
     2017-12-19 13:26:54,306 INFO  org.flywaydb.core.internal.dbsupport.DbSupportFactory @ Database: jdbc:postgresql://localhost:5432/dspacetest (PostgreSQL 9.5)
    -
    +
  • - + +
  • After adding the Resource to server.xml on Ubuntu I get this in Catalina’s logs:

    SEVERE: Unable to create initial connections of pool.
     java.sql.SQLException: org.postgresql.Driver
     ...
     Caused by: java.lang.ClassNotFoundException: org.postgresql.Driver
    -
    +
  • - +
  • The username and password are correct, but maybe I need to copy the fucking lib there too?

  • + +
  • I tried installing Ubuntu’s libpostgresql-jdbc-java package but Tomcat still can’t find the class

  • + +
  • Let me try to symlink the lib into Tomcat’s libs:

    # ln -sv /usr/share/java/postgresql.jar /usr/share/tomcat7/lib
    -
    +
  • - +
  • Now Tomcat starts but the localhost container has errors:

    SEVERE: Exception sending context initialized event to listener instance of class org.dspace.app.util.DSpaceContextListener
     java.lang.AbstractMethodError: Method org/postgresql/jdbc3/Jdbc3ResultSet.isClosed()Z is abstract
    -
    +
  • - +
  • Could be a version issue or something since the Ubuntu package provides 9.2 and DSpace’s are 9.1…

  • + +
  • Let me try to remove it and copy in DSpace’s:

    # rm /usr/share/tomcat7/lib/postgresql.jar
     # cp [dspace]/webapps/xmlui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar /usr/share/tomcat7/lib/
    -
    +
  • - +
  • Wow, I think that actually works…

  • + +
  • I wonder if I could get the JDBC driver from postgresql.org instead of relying on the one from the DSpace build: https://jdbc.postgresql.org/

  • + +
  • I notice our version is 9.1-901, which isn’t even available anymore! The latest in the archived versions is 9.1-903

  • + +
  • Also, since I commented out all the db parameters in DSpace.cfg, how does the command line dspace tool work?

  • + +
  • Let’s try the upstream JDBC driver first:

    # rm /usr/share/tomcat7/lib/postgresql-9.1-901-1.jdbc4.jar
     # wget https://jdbc.postgresql.org/download/postgresql-42.1.4.jar -O /usr/share/tomcat7/lib/postgresql-42.1.4.jar
    -
    +
  • - +
  • DSpace command line fails unless db settings are present in dspace.cfg:

    $ dspace database info
     Caught exception:
     java.sql.SQLException: java.lang.ClassNotFoundException: 
    -        at org.dspace.storage.rdbms.DataSourceInit.getDatasource(DataSourceInit.java:171)
    -        at org.dspace.storage.rdbms.DatabaseManager.initDataSource(DatabaseManager.java:1438)
    -        at org.dspace.storage.rdbms.DatabaseUtils.main(DatabaseUtils.java:81)
    -        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -        at java.lang.reflect.Method.invoke(Method.java:498)
    -        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    -        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    +    at org.dspace.storage.rdbms.DataSourceInit.getDatasource(DataSourceInit.java:171)
    +    at org.dspace.storage.rdbms.DatabaseManager.initDataSource(DatabaseManager.java:1438)
    +    at org.dspace.storage.rdbms.DatabaseUtils.main(DatabaseUtils.java:81)
    +    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    +    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    +    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    +    at java.lang.reflect.Method.invoke(Method.java:498)
    +    at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    +    at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
     Caused by: java.lang.ClassNotFoundException: 
    -        at java.lang.Class.forName0(Native Method)
    -        at java.lang.Class.forName(Class.java:264)
    -        at org.dspace.storage.rdbms.DataSourceInit.getDatasource(DataSourceInit.java:41)
    -        ... 8 more
    -
    + at java.lang.Class.forName0(Native Method) + at java.lang.Class.forName(Class.java:264) + at org.dspace.storage.rdbms.DataSourceInit.getDatasource(DataSourceInit.java:41) + ... 8 more +
  • - +
  • And in the logs:

    2017-12-19 18:26:56,971 ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspace
     javax.naming.NoInitialContextException: Need to specify class name in environment or system property, or as an applet parameter, or in an application resource file:  java.naming.factory.initial
    -        at javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:662)
    -        at javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:313)
    -        at javax.naming.InitialContext.getURLOrDefaultInitCtx(InitialContext.java:350)
    -        at javax.naming.InitialContext.lookup(InitialContext.java:417)
    -        at org.dspace.storage.rdbms.DatabaseManager.initDataSource(DatabaseManager.java:1413)
    -        at org.dspace.storage.rdbms.DatabaseUtils.main(DatabaseUtils.java:81)
    -        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -        at java.lang.reflect.Method.invoke(Method.java:498)
    -        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    -        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    +    at javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:662)
    +    at javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:313)
    +    at javax.naming.InitialContext.getURLOrDefaultInitCtx(InitialContext.java:350)
    +    at javax.naming.InitialContext.lookup(InitialContext.java:417)
    +    at org.dspace.storage.rdbms.DatabaseManager.initDataSource(DatabaseManager.java:1413)
    +    at org.dspace.storage.rdbms.DatabaseUtils.main(DatabaseUtils.java:81)
    +    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    +    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    +    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    +    at java.lang.reflect.Method.invoke(Method.java:498)
    +    at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    +    at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
     2017-12-19 18:26:56,983 INFO  org.dspace.storage.rdbms.DatabaseManager @ Unable to locate JNDI dataSource: jdbc/dspace
     2017-12-19 18:26:56,983 INFO  org.dspace.storage.rdbms.DatabaseManager @ Falling back to creating own Database pool
     2017-12-19 18:26:56,992 WARN  org.dspace.core.ConfigurationManager @ Warning: Number format error in property: db.maxconnections
     2017-12-19 18:26:56,992 WARN  org.dspace.core.ConfigurationManager @ Warning: Number format error in property: db.maxwait
     2017-12-19 18:26:56,993 WARN  org.dspace.core.ConfigurationManager @ Warning: Number format error in property: db.maxidle
    -
    +
  • -

    2017-12-20

    @@ -807,40 +795,39 @@ javax.naming.NoInitialContextException: Need to specify class name in environmen + +
  • Test and import 13 records to CGSpace for Abenet:

    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ dspace import -a -e aorth@mjanja.ch -s /home/aorth/cg_system_20Dec/SimpleArchiveFormat -m systemoffice.map &> systemoffice.log
    -
    +
  • - +
  • The fucking database went from 47 to 72 to 121 connections while I was importing so it stalled.

  • + +
  • Since I had to restart Tomcat anyways, I decided to just deploy the new JNDI connection pooling stuff on CGSpace

  • + +
  • There was an initial connection storm of 50 PostgreSQL connections, but then it settled down to 7

  • + +
  • After that CGSpace came up fine and I was able to import the 13 items just fine:

    $ dspace import -a -e aorth@mjanja.ch -s /home/aorth/cg_system_20Dec/SimpleArchiveFormat -m systemoffice.map &> systemoffice.log
     $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -i 10568/89287
    -
    +
  • -

    2017-12-24

    + +
  • I’m playing with reading all of a month’s nginx logs into goaccess:

    # find /var/log/nginx -type f -newermt "2017-12-01" | xargs zcat --force | goaccess --log-format=COMBINED -
    -
    +
  • - + +
  • And just before that I see this:

    Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
    -
    +
  • - +
  • Ah hah! So the pool was actually empty!

  • + +
  • I need to increase that, let’s try to bump it up from 50 to 75

  • + +
  • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw

  • + +
  • I notice this error quite a few times in dspace.log:

    2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
     org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.
    -
    +
  • - +
  • And there are many of these errors every day for the past month:

    $ grep -c "Error while searching for sidebar facets" dspace.log.*
     dspace.log.2017-11-21:4
    @@ -318,179 +313,170 @@ dspace.log.2017-12-30:89
     dspace.log.2017-12-31:53
     dspace.log.2018-01-01:45
     dspace.log.2018-01-02:34
    -
    +
  • -

    2018-01-03

    + +
  • Looks like I need to increase the database pool size again:

    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
     dspace.log.2018-01-01:0
     dspace.log.2018-01-02:1972
     dspace.log.2018-01-03:1909
    -
    +
  • -

    CGSpace PostgreSQL connections

    +
  • The active IPs in XMLUI are:

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "3/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -    607 40.77.167.141
    -    611 2a00:23c3:8c94:7800:392c:a491:e796:9c50
    -    663 188.226.169.37
    -    759 157.55.39.245
    -    887 68.180.229.254
    -   1037 157.55.39.175
    -   1068 216.244.66.245
    -   1495 66.249.64.91
    -   1934 104.196.152.243
    -   2219 134.155.96.78
    -
    +607 40.77.167.141 +611 2a00:23c3:8c94:7800:392c:a491:e796:9c50 +663 188.226.169.37 +759 157.55.39.245 +887 68.180.229.254 +1037 157.55.39.175 +1068 216.244.66.245 +1495 66.249.64.91 +1934 104.196.152.243 +2219 134.155.96.78 +
  • - +
  • 134.155.96.78 appears to be at the University of Mannheim in Germany

  • + +
  • They identify as: Mozilla/5.0 (compatible; heritrix/3.2.0 +http://ifm.uni-mannheim.de)

  • + +
  • This appears to be the Internet Archive’s open source bot

  • + +
  • They seem to be re-using their Tomcat session so I don’t need to do anything to them just yet:

    $ grep 134.155.96.78 dspace.log.2018-01-03 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     2
    -
    +
  • - +
  • The API logs show the normal users:

    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "3/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -     32 207.46.13.182
    -     38 40.77.167.132
    -     38 68.180.229.254
    -     43 66.249.64.91
    -     46 40.77.167.141
    -     49 157.55.39.245
    -     79 157.55.39.175
    -   1533 50.116.102.77
    -   4069 70.32.83.92
    -   9355 45.5.184.196
    -
    + 32 207.46.13.182 + 38 40.77.167.132 + 38 68.180.229.254 + 43 66.249.64.91 + 46 40.77.167.141 + 49 157.55.39.245 + 79 157.55.39.175 +1533 50.116.102.77 +4069 70.32.83.92 +9355 45.5.184.196 +
  • - +
  • In other related news I see a sizeable amount of requests coming from python-requests

  • + +
  • For example, just in the last day there were 1700!

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -c python-requests
     1773
    -
    +
  • - +
  • But they come from hundreds of IPs, many of which are 54.x.x.x:

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep python-requests | awk '{print $1}' | sort -n | uniq -c | sort -h | tail -n 30
    -      9 54.144.87.92
    -      9 54.146.222.143
    -      9 54.146.249.249
    -      9 54.158.139.206
    -      9 54.161.235.224
    -      9 54.163.41.19
    -      9 54.163.4.51
    -      9 54.196.195.107
    -      9 54.198.89.134
    -      9 54.80.158.113
    -     10 54.198.171.98
    -     10 54.224.53.185
    -     10 54.226.55.207
    -     10 54.227.8.195
    -     10 54.242.234.189
    -     10 54.242.238.209
    -     10 54.80.100.66
    -     11 54.161.243.121
    -     11 54.205.154.178
    -     11 54.234.225.84
    -     11 54.87.23.173
    -     11 54.90.206.30
    -     12 54.196.127.62
    -     12 54.224.242.208
    -     12 54.226.199.163
    -     13 54.162.149.249
    -     13 54.211.182.255
    -     19 50.17.61.150
    -     21 54.211.119.107
    -    139 164.39.7.62
    -
    + 9 54.144.87.92 + 9 54.146.222.143 + 9 54.146.249.249 + 9 54.158.139.206 + 9 54.161.235.224 + 9 54.163.41.19 + 9 54.163.4.51 + 9 54.196.195.107 + 9 54.198.89.134 + 9 54.80.158.113 + 10 54.198.171.98 + 10 54.224.53.185 + 10 54.226.55.207 + 10 54.227.8.195 + 10 54.242.234.189 + 10 54.242.238.209 + 10 54.80.100.66 + 11 54.161.243.121 + 11 54.205.154.178 + 11 54.234.225.84 + 11 54.87.23.173 + 11 54.90.206.30 + 12 54.196.127.62 + 12 54.224.242.208 + 12 54.226.199.163 + 13 54.162.149.249 + 13 54.211.182.255 + 19 50.17.61.150 + 21 54.211.119.107 +139 164.39.7.62 +
  • -

    2018-01-04

    + +
  • The XMLUI logs show this activity:

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "4/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -    968 197.211.63.81
    -    981 213.55.99.121
    -   1039 66.249.64.93
    -   1258 157.55.39.175
    -   1273 207.46.13.182
    -   1311 157.55.39.191
    -   1319 157.55.39.197
    -   1775 66.249.64.78
    -   2216 104.196.152.243
    -   3366 66.249.64.91
    -
    +968 197.211.63.81 +981 213.55.99.121 +1039 66.249.64.93 +1258 157.55.39.175 +1273 207.46.13.182 +1311 157.55.39.191 +1319 157.55.39.197 +1775 66.249.64.78 +2216 104.196.152.243 +3366 66.249.64.91 +
  • - +
  • Again we ran out of PostgreSQL database connections, even after bumping the pool max active limit from 50 to 75 to 125 yesterday!

    2018-01-04 07:36:08,089 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
     org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-256] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:125; busy:125; idle:0; lastwait:5000].
    -
    +
  • - +
  • So for this week that is the number one problem!

    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
     dspace.log.2018-01-01:0
     dspace.log.2018-01-02:1972
     dspace.log.2018-01-03:1909
     dspace.log.2018-01-04:1559
    -
    +
  • -

    2018-01-05

    + +
  • I don’t see any alerts from Linode or UptimeRobot, and there are no PostgreSQL connection errors in the dspace logs for today:

    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
     dspace.log.2018-01-01:0
    @@ -498,21 +484,20 @@ dspace.log.2018-01-02:1972
     dspace.log.2018-01-03:1909
     dspace.log.2018-01-04:1559
     dspace.log.2018-01-05:0
    -
    +
  • - +
  • Daniel asked for help with their DAGRIS server (linode2328112) that has no disk space

  • + +
  • I had a look and there is one Apache 2 log file that is 73GB, with lots of this:

    [Fri Jan 05 09:31:22.965398 2018] [:error] [pid 9340] [client 213.55.99.121:64476] WARNING: Unable to find a match for "9-16-1-RV.doc" in "/home/files/journals/6//articles/9/". Skipping this file., referer: http://dagris.info/reviewtool/index.php/index/install/upgrade
    -
    +
  • - +
  • I will delete the log file for now and tell Danny

  • + +
  • Also, I’m still seeing a hundred or so of the “ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer” errors in dspace logs, I need to search the dspace-tech mailing list to see what the cause is

  • + +
  • I will run a full Discovery reindex in the mean time to see if it’s something wrong with the Discovery Solr core

    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
    @@ -520,220 +505,211 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discov
     real    110m43.985s
     user    15m24.960s
     sys     3m14.890s
    -
    +
  • -

    2018-01-06

    +
  • I’m still seeing Solr errors in the DSpace logs even after the full reindex yesterday:

    org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1983+TO+1989]': Encountered " "]" "] "" at line 1, column 32.
    -
    +
  • -

    2018-01-09

    + +
  • Generate a list of author affiliations for Peter to clean up:

    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
     COPY 4515
    -
    +
  • +

    2018-01-10

    +
  • I looked to see what happened to this year’s Solr statistics sharding task that should have run on 2018-01-01 and of course it failed:

    Moving: 81742 into core statistics-2010
     Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2010
     org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2010
    -        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
    -        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
    -        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
    -        at org.dspace.statistics.SolrLogger.shardSolrIndex(SourceFile:2243)
    -        at org.dspace.statistics.util.StatisticsClient.main(StatisticsClient.java:106)
    -        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -        at java.lang.reflect.Method.invoke(Method.java:498)
    -        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    -        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    +    at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
    +    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
    +    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
    +    at org.dspace.statistics.SolrLogger.shardSolrIndex(SourceFile:2243)
    +    at org.dspace.statistics.util.StatisticsClient.main(StatisticsClient.java:106)
    +    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    +    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    +    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    +    at java.lang.reflect.Method.invoke(Method.java:498)
    +    at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    +    at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
     Caused by: org.apache.http.client.ClientProtocolException
    -        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:867)
    -        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
    -        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
    -        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
    -        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448)
    -        ... 10 more
    +    at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:867)
    +    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
    +    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
    +    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
    +    at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448)
    +    ... 10 more
     Caused by: org.apache.http.client.NonRepeatableRequestException: Cannot retry request with a non-repeatable request entity.  The cause lists the reason the original request failed.
    -        at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:659)
    -        at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
    -        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
    -        ... 14 more
    +    at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:659)
    +    at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
    +    at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
    +    ... 14 more
     Caused by: java.net.SocketException: Connection reset
    -        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:115)
    -        at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
    -        at org.apache.http.impl.io.AbstractSessionOutputBuffer.flushBuffer(AbstractSessionOutputBuffer.java:159)
    -        at org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:179)
    -        at org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:124)
    -        at org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:181)
    -        at org.apache.http.entity.InputStreamEntity.writeTo(InputStreamEntity.java:132)
    -        at org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:89)
    -        at org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108)
    -        at org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:117)
    -        at org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:265)
    -        at org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:203)
    -        at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:236)
    -        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121)
    -        at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
    -        ... 16 more
    -
    + at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:115) + at java.net.SocketOutputStream.write(SocketOutputStream.java:155) + at org.apache.http.impl.io.AbstractSessionOutputBuffer.flushBuffer(AbstractSessionOutputBuffer.java:159) + at org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:179) + at org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:124) + at org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:181) + at org.apache.http.entity.InputStreamEntity.writeTo(InputStreamEntity.java:132) + at org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:89) + at org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108) + at org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:117) + at org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:265) + at org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:203) + at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:236) + at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121) + at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685) + ... 16 more +
  • - +
  • DSpace Test has the same error but with creating the 2017 core:

    Moving: 2243021 into core statistics-2017
     Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2017
     org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2017
    -        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
    -        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
    -        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
    -        at org.dspace.statistics.SolrLogger.shardSolrIndex(SourceFile:2243)
    -        at org.dspace.statistics.util.StatisticsClient.main(StatisticsClient.java:106)
    -        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -        at java.lang.reflect.Method.invoke(Method.java:498)
    -        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    -        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    +    at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
    +    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
    +    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
    +    at org.dspace.statistics.SolrLogger.shardSolrIndex(SourceFile:2243)
    +    at org.dspace.statistics.util.StatisticsClient.main(StatisticsClient.java:106)
    +    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    +    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    +    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    +    at java.lang.reflect.Method.invoke(Method.java:498)
    +    at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    +    at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
     Caused by: org.apache.http.client.ClientProtocolException
    -        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:867)
    -        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
    -        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
    -        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
    -        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448)
    -        ... 10 more
    -
    + at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:867) + at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) + at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) + at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) + at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448) + ... 10 more +
  • - +
  • There is interesting documentation about this on the DSpace Wiki: https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-SolrShardingByYear

  • + +
  • I’m looking to see maybe if we’re hitting the issues mentioned in DS-2212 that were apparently fixed in DSpace 5.2

  • + +
  • I can apparently search for records in the Solr stats core that have an empty owningColl field using this in the Solr admin query: -owningColl:*

  • + +
  • On CGSpace I see 48,000,000 records that have an owningColl field and 34,000,000 that don’t:

    $ http 'http://localhost:3000/solr/statistics/select?q=owningColl%3A*&wt=json&indent=true' | grep numFound 
    -  "response":{"numFound":48476327,"start":0,"docs":[
    +"response":{"numFound":48476327,"start":0,"docs":[
     $ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&wt=json&indent=true' | grep numFound
    -  "response":{"numFound":34879872,"start":0,"docs":[
    -
    +"response":{"numFound":34879872,"start":0,"docs":[ +
  • - +
  • I tested the dspace stats-util -s process on my local machine and it failed the same way

  • + +
  • It doesn’t seem to be helpful, but the dspace log shows this:

    2018-01-10 10:51:19,301 INFO  org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
     2018-01-10 10:51:19,301 INFO  org.dspace.statistics.SolrLogger @ Moving: 3821 records into core statistics-2016
    -
    +
  • - +
  • Terry Brady has written some notes on the DSpace Wiki about Solr sharing issues: https://wiki.duraspace.org/display/%7Eterrywbrady/Statistics+Import+Export+Issues

  • + +
  • Uptime Robot said that CGSpace went down at around 9:43 AM

  • + +
  • I looked at PostgreSQL’s pg_stat_activity table and saw 161 active connections, but no pool errors in the DSpace logs:

    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-10 
     0
    -
    +
  • - +
  • The XMLUI logs show quite a bit of activity today:

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep "10/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -    951 207.46.13.159
    -    954 157.55.39.123
    -   1217 95.108.181.88
    -   1503 104.196.152.243
    -   6455 70.36.107.50
    -  11412 70.36.107.190
    -  16730 70.36.107.49
    -  17386 2607:fa98:40:9:26b6:fdff:feff:1c96
    -  21566 2607:fa98:40:9:26b6:fdff:feff:195d
    -  45384 2607:fa98:40:9:26b6:fdff:feff:1888
    -
    +951 207.46.13.159 +954 157.55.39.123 +1217 95.108.181.88 +1503 104.196.152.243 +6455 70.36.107.50 +11412 70.36.107.190 +16730 70.36.107.49 +17386 2607:fa98:40:9:26b6:fdff:feff:1c96 +21566 2607:fa98:40:9:26b6:fdff:feff:195d +45384 2607:fa98:40:9:26b6:fdff:feff:1888 +
  • - +
  • The user agent for the top six or so IPs are all the same:

    "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36"
    -
    +
  • - +
  • whois says they come from Perfect IP

  • + +
  • I’ve never seen those top IPs before, but they have created 50,000 Tomcat sessions today:

    $ grep -E '(2607:fa98:40:9:26b6:fdff:feff:1888|2607:fa98:40:9:26b6:fdff:feff:195d|2607:fa98:40:9:26b6:fdff:feff:1c96|70.36.107.49|70.36.107.190|70.36.107.50)' /home/cgspace.cgiar.org/log/dspace.log.2018-01-10 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l                                                                                                                                                                                                  
     49096
    -
    +
  • - +
  • Rather than blocking their IPs, I think I might just add their user agent to the “badbots” zone with Baidu, because they seem to be the only ones using that user agent:

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari
     /537.36" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -   6796 70.36.107.50
    -  11870 70.36.107.190
    -  17323 70.36.107.49
    -  19204 2607:fa98:40:9:26b6:fdff:feff:1c96
    -  23401 2607:fa98:40:9:26b6:fdff:feff:195d 
    -  47875 2607:fa98:40:9:26b6:fdff:feff:1888
    -
    +6796 70.36.107.50 +11870 70.36.107.190 +17323 70.36.107.49 +19204 2607:fa98:40:9:26b6:fdff:feff:1c96 +23401 2607:fa98:40:9:26b6:fdff:feff:195d +47875 2607:fa98:40:9:26b6:fdff:feff:1888 +
  • - +
  • I added the user agent to nginx’s badbots limit req zone but upon testing the config I got an error:

    # nginx -t
     nginx: [emerg] could not build map_hash, you should increase map_hash_bucket_size: 64
     nginx: configuration file /etc/nginx/nginx.conf test failed
    -
    +
  • - +
  • According to nginx docs the bucket size should be a multiple of the CPU’s cache alignment, which is 64 for us:

    # cat /proc/cpuinfo | grep cache_alignment | head -n1
     cache_alignment : 64
    -
    +
  • -

    2018-01-11

    @@ -747,8 +723,8 @@ cache_alignment : 64 + +
  • Following up with the Solr sharding issue on the dspace-tech mailing list, I noticed this interesting snippet in the Tomcat localhost_access_log at the time of my sharding attempt on my test machine:

    127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/statistics/select?q=type%3A2+AND+id%3A1&wt=javabin&version=2 HTTP/1.1" 200 107
     127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/statistics/select?q=*%3A*&rows=0&facet=true&facet.range=time&facet.range.start=NOW%2FYEAR-18YEARS&facet.range.end=NOW%2FYEAR%2B0YEARS&facet.range.gap=%2B1YEAR&facet.mincount=1&wt=javabin&version=2 HTTP/1.1" 200 447
    @@ -757,137 +733,140 @@ cache_alignment : 64
     127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/statistics/select?csv.mv.separator=%7C&q=*%3A*&fq=time%3A%28%5B2016%5C-01%5C-01T00%5C%3A00%5C%3A00Z+TO+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%5D+NOT+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%29&rows=10000&wt=csv HTTP/1.1" 200 2137630
     127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/statistics/admin/luke?show=schema&wt=javabin&version=2 HTTP/1.1" 200 16253
     127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "POST /solr//statistics-2016/update/csv?commit=true&softCommit=false&waitSearcher=true&f.previousWorkflowStep.split=true&f.previousWorkflowStep.separator=%7C&f.previousWorkflowStep.encapsulator=%22&f.actingGroupId.split=true&f.actingGroupId.separator=%7C&f.actingGroupId.encapsulator=%22&f.containerCommunity.split=true&f.containerCommunity.separator=%7C&f.containerCommunity.encapsulator=%22&f.range.split=true&f.range.separator=%7C&f.range.encapsulator=%22&f.containerItem.split=true&f.containerItem.separator=%7C&f.containerItem.encapsulator=%22&f.p_communities_map.split=true&f.p_communities_map.separator=%7C&f.p_communities_map.encapsulator=%22&f.ngram_query_search.split=true&f.ngram_query_search.separator=%7C&f.ngram_query_search.encapsulator=%22&f.containerBitstream.split=true&f.containerBitstream.separator=%7C&f.containerBitstream.encapsulator=%22&f.owningItem.split=true&f.owningItem.separator=%7C&f.owningItem.encapsulator=%22&f.actingGroupParentId.split=true&f.actingGroupParentId.separator=%7C&f.actingGroupParentId.encapsulator=%22&f.text.split=true&f.text.separator=%7C&f.text.encapsulator=%22&f.simple_query_search.split=true&f.simple_query_search.separator=%7C&f.simple_query_search.encapsulator=%22&f.owningComm.split=true&f.owningComm.separator=%7C&f.owningComm.encapsulator=%22&f.owner.split=true&f.owner.separator=%7C&f.owner.encapsulator=%22&f.filterquery.split=true&f.filterquery.separator=%7C&f.filterquery.encapsulator=%22&f.p_group_map.split=true&f.p_group_map.separator=%7C&f.p_group_map.encapsulator=%22&f.actorMemberGroupId.split=true&f.actorMemberGroupId.separator=%7C&f.actorMemberGroupId.encapsulator=%22&f.bitstreamId.split=true&f.bitstreamId.separator=%7C&f.bitstreamId.encapsulator=%22&f.group_name.split=true&f.group_name.separator=%7C&f.group_name.encapsulator=%22&f.p_communities_name.split=true&f.p_communities_name.separator=%7C&f.p_communities_name.encapsulator=%22&f.query.split=true&f.query.separator=%7C&f.query.encapsulator=%22&f.workflowStep.split=true&f.workflowStep.separator=%7C&f.workflowStep.encapsulator=%22&f.containerCollection.split=true&f.containerCollection.separator=%7C&f.containerCollection.encapsulator=%22&f.complete_query_search.split=true&f.complete_query_search.separator=%7C&f.complete_query_search.encapsulator=%22&f.p_communities_id.split=true&f.p_communities_id.separator=%7C&f.p_communities_id.encapsulator=%22&f.rangeDescription.split=true&f.rangeDescription.separator=%7C&f.rangeDescription.encapsulator=%22&f.group_id.split=true&f.group_id.separator=%7C&f.group_id.encapsulator=%22&f.bundleName.split=true&f.bundleName.separator=%7C&f.bundleName.encapsulator=%22&f.ngram_simplequery_search.split=true&f.ngram_simplequery_search.separator=%7C&f.ngram_simplequery_search.encapsulator=%22&f.group_map.split=true&f.group_map.separator=%7C&f.group_map.encapsulator=%22&f.owningColl.split=true&f.owningColl.separator=%7C&f.owningColl.encapsulator=%22&f.p_group_id.split=true&f.p_group_id.separator=%7C&f.p_group_id.encapsulator=%22&f.p_group_name.split=true&f.p_group_name.separator=%7C&f.p_group_name.encapsulator=%22&wt=javabin&version=2 HTTP/1.1" 409 156
    -
    +
  • - +
  • The new core is created but when DSpace attempts to POST to it there is an HTTP 409 error

  • + +
  • This is apparently a common Solr error code that means “version conflict”: http://yonik.com/solr/optimistic-concurrency/

  • + +
  • Looks like that bot from the PerfectIP.net host ended up making about 450,000 requests to XMLUI alone yesterday:

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36" | grep "10/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -  21572 70.36.107.50
    -  30722 70.36.107.190
    -  34566 70.36.107.49
    - 101829 2607:fa98:40:9:26b6:fdff:feff:195d
    - 111535 2607:fa98:40:9:26b6:fdff:feff:1c96
    - 161797 2607:fa98:40:9:26b6:fdff:feff:1888
    -
    +21572 70.36.107.50 +30722 70.36.107.190 +34566 70.36.107.49 +101829 2607:fa98:40:9:26b6:fdff:feff:195d +111535 2607:fa98:40:9:26b6:fdff:feff:1c96 +161797 2607:fa98:40:9:26b6:fdff:feff:1888 +
  • - +
  • Wow, I just figured out how to set the application name of each database pool in the JNDI config of Tomcat’s server.xml:

    <Resource name="jdbc/dspaceWeb" auth="Container" type="javax.sql.DataSource"
    -          driverClassName="org.postgresql.Driver"
    -          url="jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceWeb"
    -          username="dspace"
    -          password="dspace"
    -          initialSize='5'
    -          maxActive='75'
    -          maxIdle='15'
    -          minIdle='5'
    -          maxWait='5000'
    -          validationQuery='SELECT 1'
    -          testOnBorrow='true' />
    -
    + driverClassName="org.postgresql.Driver" + url="jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceWeb" + username="dspace" + password="dspace" + initialSize='5' + maxActive='75' + maxIdle='15' + minIdle='5' + maxWait='5000' + validationQuery='SELECT 1' + testOnBorrow='true' /> +
  • - +
  • So theoretically I could name each connection “xmlui” or “dspaceWeb” or something meaningful and it would show up in PostgreSQL’s pg_stat_activity table!

  • + +
  • This would be super helpful for figuring out where load was coming from (now I wonder if I could figure out how to graph this)

  • + +
  • Also, I realized that the db.jndi parameter in dspace.cfg needs to match the name value in your applicaiton’s context—not the global one

  • + +
  • Ah hah! Also, I can name the default DSpace connection pool in dspace.cfg as well, like:

    db.url = jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceDefault
    -
    +
  • -

    2018-01-12

    +
  • I’m looking at the DSpace 6.0 Install docs and notice they tweak the number of threads in their Tomcat connector:

    <!-- Define a non-SSL HTTP/1.1 Connector on port 8080 -->
     <Connector port="8080"
    -           maxThreads="150"
    -           minSpareThreads="25"
    -           maxSpareThreads="75"
    -           enableLookups="false"
    -           redirectPort="8443"
    -           acceptCount="100"
    -           connectionTimeout="20000"
    -           disableUploadTimeout="true"
    -           URIEncoding="UTF-8"/>
    -
    + maxThreads="150" + minSpareThreads="25" + maxSpareThreads="75" + enableLookups="false" + redirectPort="8443" + acceptCount="100" + connectionTimeout="20000" + disableUploadTimeout="true" + URIEncoding="UTF-8"/> +
  • - +
  • In Tomcat 8.5 the maxThreads defaults to 200 which is probably fine, but tweaking minSpareThreads could be good

  • + +
  • I don’t see a setting for maxSpareThreads in the docs so that might be an error

  • + +
  • Looks like in Tomcat 8.5 the default URIEncoding for Connectors is UTF-8, so we don’t need to specify that manually anymore: https://tomcat.apache.org/tomcat-8.5-doc/config/http.html

  • + +
  • Ooh, I just saw the acceptorThreadCount setting (in Tomcat 7 and 8.5):

    The number of threads to be used to accept connections. Increase this value on a multi CPU machine, although you would never really need more than 2. Also, with a lot of non keep alive connections, you might want to increase this value as well. Default value is 1.
    -
    +
  • -

    2018-01-13

    + +
  • Catalina errors at Tomcat 8.5 startup:

    13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxActive is not used in DBCP2, use maxTotal instead. maxTotal default value is 8. You have set value of "35" for "maxActive" property, which is being ignored.
     13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxWait is not used in DBCP2 , use maxWaitMillis instead. maxWaitMillis default value is -1. You have set value of "5000" for "maxWait" property, which is being ignored.
    -
    +
  • - +
  • I looked in my Tomcat 7.0.82 logs and I don’t see anything about DBCP2 errors, so I guess this a Tomcat 8.0.x or 8.5.x thing

  • + +
  • DBCP2 appears to be Tomcat 8.0.x and up according to the Tomcat 8.0 migration guide

  • + +
  • I have updated our Ansible infrastructure scripts so that it will be ready whenever we switch to Tomcat 8 (probably with Ubuntu 18.04 later this year)

  • + +
  • When I enable the ResourceLink in the ROOT.xml context I get the following error in the Tomcat localhost log:

    13-Jan-2018 14:14:36.017 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.app.util.DSpaceWebappListener]
    - java.lang.ExceptionInInitializerError
    -        at org.dspace.app.util.AbstractDSpaceWebapp.register(AbstractDSpaceWebapp.java:74)
    -        at org.dspace.app.util.DSpaceWebappListener.contextInitialized(DSpaceWebappListener.java:31)
    -        at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4745)
    -        at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5207)
    -        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
    -        at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:752)
    -        at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:728)
    -        at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:734)
    -        at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:629)
    -        at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1839)
    -        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    -        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    -        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    -        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    -        at java.lang.Thread.run(Thread.java:748)
    +java.lang.ExceptionInInitializerError
    +    at org.dspace.app.util.AbstractDSpaceWebapp.register(AbstractDSpaceWebapp.java:74)
    +    at org.dspace.app.util.DSpaceWebappListener.contextInitialized(DSpaceWebappListener.java:31)
    +    at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4745)
    +    at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5207)
    +    at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
    +    at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:752)
    +    at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:728)
    +    at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:734)
    +    at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:629)
    +    at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1839)
    +    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    +    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    +    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    +    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    +    at java.lang.Thread.run(Thread.java:748)
     Caused by: java.lang.NullPointerException
    -        at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:547)
    -        at org.dspace.core.Context.<clinit>(Context.java:103)
    -        ... 15 more
    -
    + at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:547) + at org.dspace.core.Context.<clinit>(Context.java:103) + ... 15 more +
  • -

    2018-01-14

    @@ -903,8 +882,8 @@ Caused by: java.lang.NullPointerException + +
  • I’m going to apply these ~130 corrections on CGSpace:

    update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
     delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
    @@ -915,93 +894,84 @@ update metadatavalue set text_value='vi' where resource_type_id=2 and metadata_f
     update metadatavalue set text_value='ru' where resource_type_id=2 and metadata_field_id=38 and text_value='Ru';
     update metadatavalue set text_value='in' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(IN|In)';
     delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)';
    -
    +
  • -

    OpenRefine Authors

    +
  • Apply corrections using fix-metadata-values.py:

    $ ./fix-metadata-values.py -i /tmp/2018-01-14-Authors-1300-Corrections.csv -f dc.contributor.author -t correct -m 3 -d dspace-u dspace -p 'fuuu'
    -
    +
  • - +
  • In looking at some of the values to delete or check I found some metadata values that I could not resolve their handle via SQL:

    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Tarawali';
    - metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
    +metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
     -------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------
    -           2757936 |        4369 |                 3 | Tarawali   |           |     9 |           |        600 |                2
    +       2757936 |        4369 |                 3 | Tarawali   |           |     9 |           |        600 |                2
     (1 row)
     
     dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '4369';
    - handle
    +handle
     --------
     (0 rows)
    -
    +
  • - +
  • Even searching in the DSpace advanced search for author equals “Tarawali” produces nothing…

  • + +
  • Otherwise, the DSpace 5 SQL Helper Functions provide ds5_item2itemhandle(), which is much easier than my long query above that I always have to go search for

  • + +
  • For example, to find the Handle for an item that has the author “Erni”:

    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Erni';
    - metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place |              authority               | confidence | resource_type_id 
    +metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place |              authority               | confidence | resource_type_id 
     -------------------+-------------+-------------------+------------+-----------+-------+--------------------------------------+------------+------------------
    -           2612150 |       70308 |                 3 | Erni       |           |     9 | 3fe10c68-6773-49a7-89cc-63eb508723f2 |         -1 |                2
    +       2612150 |       70308 |                 3 | Erni       |           |     9 | 3fe10c68-6773-49a7-89cc-63eb508723f2 |         -1 |                2
     (1 row)
     dspace=# select ds5_item2itemhandle(70308);
    - ds5_item2itemhandle 
    +ds5_item2itemhandle 
     ---------------------
    - 10568/68609
    +10568/68609
     (1 row)
    -
    +
  • - +
  • Next I apply the author deletions:

    $ ./delete-metadata-values.py -i /tmp/2018-01-14-Authors-5-Deletions.csv -f dc.contributor.author -m 3 -d dspace -u dspace -p 'fuuu'
    -
    +
  • - +
  • Now working on the affiliation corrections from Peter:

    $ ./fix-metadata-values.py -i /tmp/2018-01-15-Affiliations-888-Corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
     $ ./delete-metadata-values.py -i /tmp/2018-01-15-Affiliations-11-Deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
    -
    +
  • - +
  • Now I made a new list of affiliations for Peter to look through:

    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where metadata_schema_id = 2 and element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
     COPY 4552
    -
    +
  • - +
  • Looking over the affiliations again I see dozens of CIAT ones with their affiliation formatted like: International Center for Tropical Agriculture (CIAT)

  • + +
  • For example, this one is from just last month: https://cgspace.cgiar.org/handle/10568/89930

  • + +
  • Our controlled vocabulary has this in the format without the abbreviation: International Center for Tropical Agriculture

  • + +
  • So some submitters don’t know to use the controlled vocabulary lookup

  • + +
  • Help Sisay with some thumbnails for book chapters in Open Refine and SAFBuilder

  • + +
  • CGSpace users were having problems logging in, I think something’s wrong with LDAP because I see this in the logs:

    2018-01-15 12:53:15,810 WARN  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=2386749547D03E0AA4EC7E44181A7552:ip_addr=x.x.x.x:ldap_authentication:type=failed_auth javax.naming.AuthenticationException\colon; [LDAP\colon; error code 49 - 80090308\colon; LdapErr\colon; DSID-0C090400, comment\colon; AcceptSecurityContext error, data 775, v1db1^@]
    -
    +
  • - +
  • Looks like we processed 2.9 million requests on CGSpace in 2017-12:

    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Dec/2017"
     2890041
    @@ -1009,7 +979,8 @@ COPY 4552
     real    0m25.756s
     user    0m28.016s
     sys     0m2.210s
    -
    +
  • +

    2018-01-16

    @@ -1038,143 +1009,138 @@ sys 0m2.210s
  • Abenet asked me to proof and upload 54 records for LIVES
  • A few records were missing countries (even though they’re all from Ethiopia)
  • Also, there are whitespace issues in many columns, and the items are mapped to the LIVES and ILRI articles collections, not Theses
  • -
  • In any case, importing them like this:
  • - + +
  • In any case, importing them like this:

    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFormat -m lives.map &> lives.log
    -
    +
  • - +
  • And fantastic, before I started the import there were 10 PostgreSQL connections, and then CGSpace crashed during the upload

  • + +
  • When I looked there were 210 PostgreSQL connections!

  • + +
  • I don’t see any high load in XMLUI or REST/OAI:

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "17/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -    381 40.77.167.124
    -    403 213.55.99.121
    -    431 207.46.13.60
    -    445 157.55.39.113
    -    445 157.55.39.231
    -    449 95.108.181.88
    -    453 68.180.229.254
    -    593 54.91.48.104
    -    757 104.196.152.243
    -    776 66.249.66.90
    +381 40.77.167.124
    +403 213.55.99.121
    +431 207.46.13.60
    +445 157.55.39.113
    +445 157.55.39.231
    +449 95.108.181.88
    +453 68.180.229.254
    +593 54.91.48.104
    +757 104.196.152.243
    +776 66.249.66.90
     # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "17/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -     11 205.201.132.14
    -     11 40.77.167.124
    -     15 35.226.23.240
    -     16 157.55.39.231
    -     16 66.249.64.155
    -     18 66.249.66.90
    -     22 95.108.181.88
    -     58 104.196.152.243
    -   4106 70.32.83.92
    -   9229 45.5.184.196
    -
    + 11 205.201.132.14 + 11 40.77.167.124 + 15 35.226.23.240 + 16 157.55.39.231 + 16 66.249.64.155 + 18 66.249.66.90 + 22 95.108.181.88 + 58 104.196.152.243 +4106 70.32.83.92 +9229 45.5.184.196 +
  • - +
  • But I do see this strange message in the dspace log:

    2018-01-17 07:59:25,856 INFO  org.apache.http.impl.client.SystemDefaultHttpClient @ I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://localhost:8081: The target server failed to respond
     2018-01-17 07:59:25,856 INFO  org.apache.http.impl.client.SystemDefaultHttpClient @ Retrying request to {}->http://localhost:8081
    -
    +
  • - +
  • I have NEVER seen this error before, and there is no error before or after that in DSpace’s solr.log

  • + +
  • Tomcat’s catalina.out does show something interesting, though, right at that time:

    [====================>                              ]40% time remaining: 7 hour(s) 14 minute(s) 45 seconds. timestamp: 2018-01-17 07:57:02
     [====================>                              ]40% time remaining: 7 hour(s) 14 minute(s) 45 seconds. timestamp: 2018-01-17 07:57:11
     [====================>                              ]40% time remaining: 7 hour(s) 14 minute(s) 44 seconds. timestamp: 2018-01-17 07:57:37
     [====================>                              ]40% time remaining: 7 hour(s) 16 minute(s) 5 seconds. timestamp: 2018-01-17 07:57:49
     Exception in thread "http-bio-127.0.0.1-8081-exec-627" java.lang.OutOfMemoryError: Java heap space
    -        at org.apache.lucene.util.FixedBitSet.clone(FixedBitSet.java:576)
    -        at org.apache.solr.search.BitDocSet.andNot(BitDocSet.java:222)
    -        at org.apache.solr.search.SolrIndexSearcher.getProcessedFilter(SolrIndexSearcher.java:1067)
    -        at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1557)
    -        at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1433)
    -        at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:514)
    -        at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:485)
    -        at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:218)
    -        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
    -        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
    -        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
    -        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
    -        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
    -        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    -        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    -        at org.dspace.solr.filters.LocalHostRestrictionFilter.doFilter(LocalHostRestrictionFilter.java:50)
    -        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    -        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    -        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
    -        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
    -        at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)
    -        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
    -        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
    -        at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180)
    -        at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956)
    -        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
    -        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436)
    -        at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078)
    -        at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625)
    -        at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:318) 
    -        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    -        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    -
    + at org.apache.lucene.util.FixedBitSet.clone(FixedBitSet.java:576) + at org.apache.solr.search.BitDocSet.andNot(BitDocSet.java:222) + at org.apache.solr.search.SolrIndexSearcher.getProcessedFilter(SolrIndexSearcher.java:1067) + at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1557) + at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1433) + at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:514) + at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:485) + at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:218) + at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) + at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) + at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777) + at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) + at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) + at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) + at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) + at org.dspace.solr.filters.LocalHostRestrictionFilter.doFilter(LocalHostRestrictionFilter.java:50) + at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) + at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) + at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221) + at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122) + at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505) + at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169) + at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103) + at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180) + at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956) + at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116) + at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436) + at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078) + at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625) + at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:318) + at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) + at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) +
  • - +
  • You can see the timestamp above, which is some Atmire nightly task I think, but I can’t figure out which one

  • + +
  • So I restarted Tomcat and tried the import again, which finished very quickly and without errors!

    $ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFormat -m lives2.map &> lives2.log
    -
    +
  • -

    Tomcat JVM Heap

    +
  • I’m playing with maven repository caching using Artifactory in a Docker instance: https://www.jfrog.com/confluence/display/RTF/Installing+with+Docker

    $ docker pull docker.bintray.io/jfrog/artifactory-oss:latest
     $ docker volume create --name artifactory5_data
     $ docker network create dspace-build
     $ docker run --network dspace-build --name artifactory -d -v artifactory5_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss:latest
    -
    +
  • - +
  • Then configure the local maven to use it in settings.xml with the settings from “Set Me Up”: https://www.jfrog.com/confluence/display/RTF/Using+Artifactory

  • + +
  • This could be a game changer for testing and running the Docker DSpace image

  • + +
  • Wow, I even managed to add the Atmire repository as a remote and map it into the libs-release virtual repository, then tell maven to use it for atmire.com-releases in settings.xml!

  • + +
  • Hmm, some maven dependencies for the SWORDv2 web application in DSpace 5.5 are broken:

    [ERROR] Failed to execute goal on project dspace-swordv2: Could not resolve dependencies for project org.dspace:dspace-swordv2:war:5.5: Failed to collect dependencies at org.swordapp:sword2-server:jar:classes:1.0 -> org.apache.abdera:abdera-client:jar:1.1.1 -> org.apache.abdera:abdera-core:jar:1.1.1 -> org.apache.abdera:abdera-i18n:jar:1.1.1 -> org.apache.geronimo.specs:geronimo-activation_1.0.2_spec:jar:1.1: Failed to read artifact descriptor for org.apache.geronimo.specs:geronimo-activation_1.0.2_spec:jar:1.1: Could not find artifact org.apache.geronimo.specs:specs:pom:1.1 in central (http://localhost:8081/artifactory/libs-release) -> [Help 1]
    -
    +
  • - +
  • I never noticed because I build with that web application disabled:

    $ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=localhost -P \!dspace-sword,\!dspace-swordv2 clean package
    -
    +
  • - +
  • UptimeRobot said CGSpace went down for a few minutes

  • + +
  • I didn’t do anything but it came back up on its own

  • + +
  • I don’t see anything unusual in the XMLUI or REST/OAI logs

  • + +
  • Now Linode alert says the CPU load is high, sigh

  • + +
  • Regarding the heap space error earlier today, it looks like it does happen a few times a week or month (I’m not sure how far these logs go back, as they are not strictly daily):

    # zgrep -c java.lang.OutOfMemoryError /var/log/tomcat7/catalina.out* | grep -v :0
     /var/log/tomcat7/catalina.out:2
    @@ -1197,11 +1163,11 @@ $ docker run --network dspace-build --name artifactory -d -v artifactory5_data:/
     /var/log/tomcat7/catalina.out.4.gz:3
     /var/log/tomcat7/catalina.out.6.gz:2
     /var/log/tomcat7/catalina.out.7.gz:14
    -
    +
  • -

    2018-01-18

    @@ -1209,94 +1175,93 @@ $ docker run --network dspace-build --name artifactory -d -v artifactory5_data:/ + +
  • I realize I never did a full re-index after the SQL author and affiliation updates last week, so I should force one now:

    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-discovery -b
    -
    +
  • - +
  • Maria from Bioversity asked if I could remove the abstracts from all of their Limited Access items in the Bioversity Journal Articles collection

  • + +
  • It’s easy enough to do in OpenRefine, but you have to be careful to only get those items that are uploaded into Bioversity’s collection, not the ones that are mapped from others!

  • + +
  • Use this GREL in OpenRefine after isolating all the Limited Access items: value.startsWith("10568/35501")

  • + +
  • UptimeRobot said CGSpace went down AGAIN and both Sisay and Danny immediately logged in and restarted Tomcat without talking to me or each other!

    Jan 18 07:01:22 linode18 sudo[10805]: dhmichael : TTY=pts/5 ; PWD=/home/dhmichael ; USER=root ; COMMAND=/bin/systemctl restart tomcat7
     Jan 18 07:01:22 linode18 sudo[10805]: pam_unix(sudo:session): session opened for user root by dhmichael(uid=0)
     Jan 18 07:01:22 linode18 systemd[1]: Stopping LSB: Start Tomcat....
     Jan 18 07:01:22 linode18 sudo[10812]: swebshet : TTY=pts/3 ; PWD=/home/swebshet ; USER=root ; COMMAND=/bin/systemctl restart tomcat7
     Jan 18 07:01:22 linode18 sudo[10812]: pam_unix(sudo:session): session opened for user root by swebshet(uid=0)
    -
    +
  • -

    2018-01-19

    + +
  • Start the Discovery indexing again:

    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-discovery -b
    -
    +
  • - +
  • Linode alerted again and said that CGSpace was using 301% CPU

  • + +
  • Peter emailed to ask why this item doesn’t have an Altmetric badge on CGSpace but does have one on the Altmetric dashboard

  • + +
  • Looks like our badge code calls the handle endpoint which doesn’t exist:

    https://api.altmetric.com/v1/handle/10568/88090
    -
    +
  • -

    2018-01-20

    +
  • Run the authority indexing script on CGSpace and of course it died:

    $ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-authority 
     Retrieving all data 
     Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer 
     Exception: null
     java.lang.NullPointerException
    -        at org.dspace.authority.AuthorityValueGenerator.generateRaw(AuthorityValueGenerator.java:82)
    -        at org.dspace.authority.AuthorityValueGenerator.generate(AuthorityValueGenerator.java:39)
    -        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.prepareNextValue(DSpaceAuthorityIndexer.java:201)
    -        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:132)
    -        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    -        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    -        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:159)
    -        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    -        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    -        at org.dspace.authority.indexer.AuthorityIndexClient.main(AuthorityIndexClient.java:61)
    -        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -        at java.lang.reflect.Method.invoke(Method.java:498)
    -        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    -        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    +    at org.dspace.authority.AuthorityValueGenerator.generateRaw(AuthorityValueGenerator.java:82)
    +    at org.dspace.authority.AuthorityValueGenerator.generate(AuthorityValueGenerator.java:39)
    +    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.prepareNextValue(DSpaceAuthorityIndexer.java:201)
    +    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:132)
    +    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    +    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    +    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:159)
    +    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    +    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    +    at org.dspace.authority.indexer.AuthorityIndexClient.main(AuthorityIndexClient.java:61)
    +    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    +    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    +    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    +    at java.lang.reflect.Method.invoke(Method.java:498)
    +    at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    +    at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
      
     real    7m2.241s
     user    1m33.198s
     sys     0m12.317s
    -
    +
  • - +
  • I tested the abstract cleanups on Bioversity’s Journal Articles collection again that I had started a few days ago

  • + +
  • In the end there were 324 items in the collection that were Limited Access, but only 199 had abstracts

  • + +
  • I want to document the workflow of adding a production PostgreSQL database to a development instance of DSpace in Docker:

    $ docker exec dspace_db dropdb -U postgres dspace
     $ docker exec dspace_db createdb -U postgres -O dspace --encoding=UNICODE dspace
    @@ -1307,7 +1272,8 @@ $ docker exec dspace_db psql -U postgres dspace -c 'alter user dspace nocreateus
     $ docker exec dspace_db vacuumdb -U postgres dspace
     $ docker cp ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspace_db:/tmp
     $ docker exec dspace_db psql -U dspace -f /tmp/update-sequences.sql dspace
    -
    +
  • +

    2018-01-22

    @@ -1327,102 +1293,99 @@ $ docker exec dspace_db psql -U dspace -f /tmp/update-sequences.sql dspace
  • I wrote a quick Python script to use the DSpace REST API to find all collections under a given community
  • The source code is here: rest-find-collections.py
  • -
  • Peter had said that found a bunch of ILRI collections that were called “untitled”, but I don’t see any:
  • - + +
  • Peter had said that found a bunch of ILRI collections that were called “untitled”, but I don’t see any:

    $ ./rest-find-collections.py 10568/1 | wc -l
     308
     $ ./rest-find-collections.py 10568/1 | grep -i untitled
    -
    +
  • -

    2018-01-23

    + +
  • I got a list of all the GET requests on CGSpace for January 21st (the last time Linode complained the load was high), excluding admin calls:

    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -c -v "/admin"
     56405
    -
    +
  • - +
  • Apparently about 28% of these requests were for bitstreams, 30% for the REST API, and 30% for handles:

    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -Eo "^/(handle|bitstream|rest|oai)/" | sort | uniq -c | sort -n
    -     38 /oai/
    -  14406 /bitstream/
    -  15179 /rest/
    -  15191 /handle/
    -
    + 38 /oai/ +14406 /bitstream/ +15179 /rest/ +15191 /handle/ +
  • - +
  • And 3% were to the homepage or search:

    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -Eo '^/($|open-search|discover)' | sort | uniq -c
    -   1050 /
    -    413 /discover
    -    170 /open-search
    -
    +1050 / +413 /discover +170 /open-search +
  • - +
  • The last 10% or so seem to be for static assets that would be served by nginx anyways:

    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -v bitstream | grep -Eo '\.(js|css|png|jpg|jpeg|php|svg|gif|txt|map)$' | sort | uniq -c | sort -n
    -      2 .gif
    -      7 .css
    -     84 .js
    -    433 .php
    -    882 .txt
    -   2551 .png
    -
    + 2 .gif + 7 .css + 84 .js +433 .php +882 .txt +2551 .png +
  • -

    2018-01-24

    +
  • Looking at the REST requests, most of them are to expand all or metadata, but 5% are for retrieving bitstreams:

    # zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/library-access.log.4.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/rest.log.4.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/oai.log.4.gz /var/log/nginx/error.log.3.gz /var/log/nginx/error.log.4.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -E "^/rest" | grep -Eo "(retrieve|expand=[a-z].*)" | sort | uniq -c | sort -n
    -      1 expand=collections
    -     16 expand=all&limit=1
    -     45 expand=items
    -    775 retrieve
    -   5675 expand=all
    -   8633 expand=metadata
    -
    + 1 expand=collections + 16 expand=all&limit=1 + 45 expand=items +775 retrieve +5675 expand=all +8633 expand=metadata +
  • - +
  • I finished creating the test plan for DSpace Test and ran it from my Linode with:

    $ jmeter -n -t DSpacePerfTest-dspacetest.cgiar.org.jmx -l 2018-01-24-1.jtl
    -
    +
  • - +
  • Atmire responded to my issue from two weeks ago and said they will start looking into DSpace 5.8 compatibility for CGSpace

  • + +
  • I set up a new Arch Linux Linode instance with 8192 MB of RAM and ran the test plan a few times to get a baseline:

    # lscpu
     # lscpu 
    @@ -1451,7 +1414,7 @@ L3 cache:            16384K
     NUMA node0 CPU(s):   0-3
     Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm cpuid_fault invpcid_single pti retpoline fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat
     # free -m
    -              total        used        free      shared  buff/cache   available
    +          total        used        free      shared  buff/cache   available
     Mem:           7970         107        7759           1         103        7771
     Swap:           255           0         255
     # pacman -Syu
    @@ -1465,38 +1428,34 @@ $ cd apache-jmeter-3.3/bin
     $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-24-linode5451120-baseline.jtl -j ~/dspace-performance-test/2018-01-24-linode5451120-baseline.log
     $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-24-linode5451120-baseline2.jtl -j ~/dspace-performance-test/2018-01-24-linode5451120-baseline2.log
     $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-24-linode5451120-baseline3.jtl -j ~/dspace-performance-test/2018-01-24-linode5451120-baseline3.log
    -
    +
  • - +
  • Then I generated reports for these runs like this:

    $ jmeter -g 2018-01-24-linode5451120-baseline.jtl -o 2018-01-24-linode5451120-baseline
    -
    +
  • +

    2018-01-25

    +
  • Run another round of tests on DSpace Test with jmeter after changing Tomcat’s minSpareThreads to 20 (default is 10) and acceptorThreadCount to 2 (default is 1):

    $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads.log
     $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads2.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads2.log
     $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads3.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads3.log
    -
    +
  • - +
  • I changed the parameters back to the baseline ones and switched the Tomcat JVM garbage collector to G1GC and re-ran the tests

  • + +
  • JVM options for Tomcat changed from -Xms3072m -Xmx3072m -XX:+UseConcMarkSweepGC to -Xms3072m -Xmx3072m -XX:+UseG1GC -XX:+PerfDisableSharedMem

    $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-g1gc.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-g1gc.log
     $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-g1gc2.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-g1gc2.log
     $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-g1gc3.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-g1gc3.log
    -
    +
  • -

    2018-01-26

    @@ -1510,17 +1469,16 @@ $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.j
  • I am testing my old work on the dc.rights field, I had added a branch for it a few months ago
  • I added a list of Creative Commons and other licenses in input-forms.xml
  • The problem is that Peter wanted to use two questions, one for CG centers and one for other, but using the same metadata value, which isn’t possible (?)
  • -
  • So I used some creativity and made several fields display values, but not store any, ie:
  • - + +
  • So I used some creativity and made several fields display values, but not store any, ie:

    <pair>
    -  <displayed-value>For products published by another party:</displayed-value>
    -  <stored-value></stored-value>
    +<displayed-value>For products published by another party:</displayed-value>
    +<stored-value></stored-value>
     </pair>
    -
    +
  • -

    Rights

    @@ -1544,47 +1502,46 @@ $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.j + +
  • Looking at the DSpace logs I see this error happened just before UptimeRobot noticed it going down:

    2018-01-29 05:30:22,226 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=3775D4125D28EF0C691B08345D905141:ip_addr=68.180.229.254:view_item:handle=10568/71890
     2018-01-29 05:30:22,322 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractSearch @ org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1994+TO+1999]': Encountered " "]" "] "" at line 1, column 32.
     Was expecting one of:
    -    "TO" ...
    -    <RANGE_QUOTED> ...
    -    <RANGE_GOOP> ...
    +"TO" ...
    +<RANGE_QUOTED> ...
    +<RANGE_GOOP> ...
         
     org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1994+TO+1999]': Encountered " "]" "] "" at line 1, column 32.
     Was expecting one of:
    -    "TO" ...
    -    <RANGE_QUOTED> ...
    -    <RANGE_GOOP> ...
    -
    +"TO" ... +<RANGE_QUOTED> ... +<RANGE_GOOP> ... +
  • - +
  • So is this an error caused by this particular client (which happens to be Yahoo! Slurp)?

  • + +
  • I see a few dozen HTTP 499 errors in the nginx access log for a few minutes before this happened, but HTTP 499 is just when nginx says that the client closed the request early

  • + +
  • Perhaps this from the nginx error log is relevant?

    2018/01/29 05:26:34 [warn] 26895#26895: *944759 an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/6/16/0000026166 while reading upstream, client: 180.76.15.34, server: cgspace.cgiar.org, request: "GET /bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12 HTTP/1.1", upstream: "http://127.0.0.1:8443/bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12", host: "cgspace.cgiar.org"
    -
    +
  • - +
  • I think that must be unrelated, probably the client closed the request to nginx because DSpace (Tomcat) was taking too long

  • + +
  • An interesting snippet to get the maximum and average nginx responses:

    # awk '($9 ~ /200/) { i++;sum+=$10;max=$10>max?$10:max; } END { printf("Maximum: %d\nAverage: %d\n",max,i?sum/i:0); }' /var/log/nginx/access.log
     Maximum: 2771268
     Average: 210483
    -
    +
  • - +
  • I guess responses that don’t fit in RAM get saved to disk (a default of 1024M), so this is definitely not the issue here, and that warning is totally unrelated

  • + +
  • My best guess is that the Solr search error is related somehow but I can’t figure it out

  • + +
  • We definitely have enough database connections, as I haven’t seen a pool error in weeks:

    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-2*
     dspace.log.2018-01-20:0
    @@ -1597,14 +1554,17 @@ dspace.log.2018-01-26:0
     dspace.log.2018-01-27:0
     dspace.log.2018-01-28:0
     dspace.log.2018-01-29:0
    -
    +
  • - + +
  • Looks like I only enabled the new thread stuff on the connector used internally by Solr, so I probably need to match that by increasing them on the other connector that nginx proxies to

  • + +
  • Jesus Christ I need to fucking fix the Munin monitoring so that I can tell how many fucking threads I have running

  • + +
  • Wow, so apparently you need to specify which connector to check if you want any of the Munin Tomcat plugins besides “tomcat_jvm” to work (the connector name can be seen in the Catalina logs)

  • + +
  • I modified /etc/munin/plugin-conf.d/tomcat to add the connector (with surrounding quotes!) and now the other plugins work (obviously the credentials are incorrect):

    [tomcat_*]
    -    env.host 127.0.0.1
    -    env.port 8081
    -    env.connector "http-bio-127.0.0.1-8443"
    -    env.user munin
    -    env.password munin
    -
    +env.host 127.0.0.1 +env.port 8081 +env.connector "http-bio-127.0.0.1-8443" +env.user munin +env.password munin +
  • - +
  • For example, I can see the threads:

    # munin-run tomcat_threads
     busy.value 0
     idle.value 20
     max.value 400
    -
    +
  • - +
  • Apparently you can’t monitor more than one connector, so I guess the most important to monitor would be the one that nginx is sending stuff to

  • + +
  • So for now I think I’ll just monitor these and skip trying to configure the jmx plugins

  • + +
  • Although following the logic of _/usr/share/munin/plugins/jmx_tomcatdbpools could be useful for getting the active Tomcat sessions

  • + +
  • From debugging the jmx_tomcat_db_pools script from the munin-plugins-java package, I see that this is how you call arbitrary mbeans:

    # port=5400 ip="127.0.0.1" /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=DataSource,class=javax.sql.DataSource,name=* maxActive
     Catalina:type=DataSource,class=javax.sql.DataSource,name="jdbc/dspace"  maxActive       300
    -
    +
  • - +
  • More notes here: https://github.com/munin-monitoring/contrib/tree/master/plugins/jmx

  • + +
  • Looking at the Munin graphs, I that the load is 200% every morning from 03:00 to almost 08:00

  • + +
  • Tomcat’s catalina.out log file is full of spam from this thing too, with lines like this

    [===================>                               ]38% time remaining: 5 hour(s) 21 minute(s) 47 seconds. timestamp: 2018-01-29 06:25:16
    -
    +
  • - +
  • There are millions of these status lines, for example in just this one log file:

    # zgrep -c "time remaining" /var/log/tomcat7/catalina.out.1.gz
     1084741
    -
    +
  • -

    2018-01-31

    @@ -1676,63 +1635,57 @@ Catalina:type=DataSource,class=javax.sql.DataSource,name="jdbc/dspace"
  • Now PostgreSQL activity shows 265 database connections!
  • I don’t see any errors anywhere…
  • Now PostgreSQL activity shows 308 connections!
  • -
  • Well this is interesting, there are 400 Tomcat threads busy:
  • - + +
  • Well this is interesting, there are 400 Tomcat threads busy:

    # munin-run tomcat_threads
     busy.value 400
     idle.value 0
     max.value 400
    -
    +
  • - +
  • And wow, we finally exhausted the database connections, from dspace.log:

    2018-01-31 08:05:28,964 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - 
     org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-451] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:300; busy:300; idle:0; lastwait:5000].
    -
    +
  • - +
  • Now even the nightly Atmire background thing is getting HTTP 500 error:

    Jan 31, 2018 8:16:05 AM com.sun.jersey.spi.container.ContainerResponse logException
     SEVERE: Mapped exception to response: 500 (Internal Server Error)
     javax.ws.rs.WebApplicationException
    -
    +
  • - +
  • For now I will restart Tomcat to clear this shit and bring the site back up

  • + +
  • The top IPs from this morning, during 7 and 8AM in XMLUI and REST/OAI:

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -     67 66.249.66.70
    -     70 207.46.13.12
    -     71 197.210.168.174
    -     83 207.46.13.13
    -     85 157.55.39.79
    -     89 207.46.13.14
    -    123 68.180.228.157
    -    198 66.249.66.90
    -    219 41.204.190.40
    -    255 2405:204:a208:1e12:132:2a8e:ad28:46c0
    + 67 66.249.66.70
    + 70 207.46.13.12
    + 71 197.210.168.174
    + 83 207.46.13.13
    + 85 157.55.39.79
    + 89 207.46.13.14
    +123 68.180.228.157
    +198 66.249.66.90
    +219 41.204.190.40
    +255 2405:204:a208:1e12:132:2a8e:ad28:46c0
     # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -      2 65.55.210.187
    -      2 66.249.66.90
    -      3 157.55.39.79
    -      4 197.232.39.92
    -      4 34.216.252.127
    -      6 104.196.152.243
    -      6 213.55.85.89
    -     15 122.52.115.13
    -     16 213.55.107.186
    -    596 45.5.184.196
    -
    + 2 65.55.210.187 + 2 66.249.66.90 + 3 157.55.39.79 + 4 197.232.39.92 + 4 34.216.252.127 + 6 104.196.152.243 + 6 213.55.85.89 + 15 122.52.115.13 + 16 213.55.107.186 +596 45.5.184.196 +
  • -

    Tomcat threads

    @@ -1748,15 +1701,14 @@ javax.ws.rs.WebApplicationException + +
  • Ok, so this is interesting: I figured out how to get the MBean path to query Tomcat’s activeSessions from JMX (using munin-plugins-java):

    # port=5400 ip="127.0.0.1" /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=Manager,context=/,host=localhost activeSessions
     Catalina:type=Manager,context=/,host=localhost  activeSessions  8
    -
    +
  • -

    MBeans in JVisualVM

    diff --git a/docs/2018-02/index.html b/docs/2018-02/index.html index 29c7c5a24..0924373c2 100644 --- a/docs/2018-02/index.html +++ b/docs/2018-02/index.html @@ -29,7 +29,7 @@ We don’t need to distinguish between internal and external works, so that Yesterday I figured out how to monitor DSpace sessions using JMX I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01 "/> - + @@ -121,14 +121,15 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-pl + +
  • Wow, I packaged up the jmx_dspace_sessions stuff in the Ansible infrastructure scripts and deployed it on CGSpace and it totally works:

    # munin-run jmx_dspace_sessions
     v_.value 223
     v_jspui.value 1
     v_oai.value 0
    -
    +
  • +

    2018-02-03

    @@ -136,35 +137,29 @@ v_oai.value 0
  • Bram from Atmire responded about the high load caused by the Solr updater script and said it will be fixed with the updates to DSpace 5.8 compatibility: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566
  • We will close that ticket for now and wait for the 5.8 stuff: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560
  • I finally took a look at the second round of cleanups Peter had sent me for author affiliations in mid January
  • -
  • After trimming whitespace and quickly scanning for encoding errors I applied them on CGSpace:
  • - + +
  • After trimming whitespace and quickly scanning for encoding errors I applied them on CGSpace:

    $ ./delete-metadata-values.py -i /tmp/2018-02-03-Affiliations-12-deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
     $ ./fix-metadata-values.py -i /tmp/2018-02-03-Affiliations-1116-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
    -
    +
  • - +
  • Then I started a full Discovery reindex:

    $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
     
     real    96m39.823s
     user    14m10.975s
     sys     2m29.088s
    -
    +
  • - +
  • Generate a new list of affiliations for Peter to sort through:

    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
     COPY 3723
    -
    +
  • - +
  • Oh, and it looks like we processed over 3.1 million requests in January, up from 2.9 million in December:

    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2018"
     3126109
    @@ -172,76 +167,88 @@ COPY 3723
     real    0m23.839s
     user    0m27.225s
     sys     0m1.905s
    -
    +
  • +

    2018-02-05

    +
  • Toying with correcting authors with trailing spaces via PostgreSQL:

    dspace=# update metadatavalue set text_value=REGEXP_REPLACE(text_value, '\s+$' , '') where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*?\s+$';
     UPDATE 20
    -
    +
  • - +
  • I tried the TRIM(TRAILING from text_value) function and it said it changed 20 items but the spaces didn’t go away

  • + +
  • This is on a fresh import of the CGSpace database, but when I tried to apply it on CGSpace there were no changes detected. Weird.

  • + +
  • Anyways, Peter wants a new list of authors to clean up, so I exported another CSV:

    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors-2018-02-05.csv with csv;
     COPY 55630
    -
    +
  • +

    2018-02-06

    + +
  • The usage otherwise seemed low for REST/OAI as well as XMLUI in the last hour:

    # date
     Tue Feb  6 09:30:32 UTC 2018
     # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "6/Feb/2018:(08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -      2 223.185.41.40
    -      2 66.249.64.14
    -      2 77.246.52.40
    -      4 157.55.39.82
    -      4 193.205.105.8
    -      5 207.46.13.63
    -      5 207.46.13.64
    -      6 154.68.16.34
    -      7 207.46.13.66
    -   1548 50.116.102.77
    +  2 223.185.41.40
    +  2 66.249.64.14
    +  2 77.246.52.40
    +  4 157.55.39.82
    +  4 193.205.105.8
    +  5 207.46.13.63
    +  5 207.46.13.64
    +  6 154.68.16.34
    +  7 207.46.13.66
    +1548 50.116.102.77
     # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "6/Feb/2018:(08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -     77 213.55.99.121
    -     86 66.249.64.14
    -    101 104.196.152.243
    -    103 207.46.13.64
    -    118 157.55.39.82
    -    133 207.46.13.66
    -    136 207.46.13.63
    -    156 68.180.228.157
    -    295 197.210.168.174
    -    752 144.76.64.79
    -
    + 77 213.55.99.121 + 86 66.249.64.14 +101 104.196.152.243 +103 207.46.13.64 +118 157.55.39.82 +133 207.46.13.66 +136 207.46.13.63 +156 68.180.228.157 +295 197.210.168.174 +752 144.76.64.79 +
  • -

    2018-02-07

    @@ -254,8 +261,8 @@ Tue Feb 6 09:30:32 UTC 2018
  • But the old URL is hard coded in DSpace and it doesn’t work anyways, because it currently redirects you to https://pub.orcid.org/v2.0/v1.2
  • So I guess we have to disable that shit once and for all and switch to a controlled vocabulary
  • CGSpace crashed again, this time around Wed Feb 7 11:20:28 UTC 2018
  • -
  • I took a few snapshots of the PostgreSQL activity at the time and as the minutes went on and the connections were very high at first but reduced on their own:
  • - + +
  • I took a few snapshots of the PostgreSQL activity at the time and as the minutes went on and the connections were very high at first but reduced on their own:

    $ psql -c 'select * from pg_stat_activity' > /tmp/pg_stat_activity.txt
     $ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
    @@ -264,42 +271,36 @@ $ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
     /tmp/pg_stat_activity3.txt:168
     /tmp/pg_stat_activity4.txt:5
     /tmp/pg_stat_activity5.txt:6
    -
    +
  • - +
  • Interestingly, all of those 751 connections were idle!

    $ grep "PostgreSQL JDBC" /tmp/pg_stat_activity* | grep -c idle
     751
    -
    +
  • -

    DSpace Sessions

    +
  • Indeed it seems like there were over 1800 sessions today around the hours of 10 and 11 AM:

    $ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     1828
    -
    +
  • - +
  • CGSpace went down again a few hours later, and now the connections to the dspaceWeb pool are maxed at 250 (the new limit I imposed with the new separate pool scheme)

  • + +
  • What’s interesting is that the DSpace log says the connections are all busy:

    org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-328] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
    -
    +
  • - +
  • … but in PostgreSQL I see them idle or idle in transaction:

    $ psql -c 'select * from pg_stat_activity' | grep -c dspaceWeb
     250
    @@ -307,50 +308,48 @@ $ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c idle
     250
     $ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c "idle in transaction"
     187
    -
    +
  • - +
  • What the fuck, does DSpace think all connections are busy?

  • + +
  • I suspect these are issues with abandoned connections or maybe a leak, so I’m going to try adding the removeAbandoned='true' parameter which is apparently off by default

  • + +
  • I will try testOnReturn='true' too, just to add more validation, because I’m fucking grasping at straws

  • + +
  • Also, WTF, there was a heap space error randomly in catalina.out:

    Wed Feb 07 15:01:54 UTC 2018 | Query:containerItem:91917 AND type:2
     Exception in thread "http-bio-127.0.0.1-8081-exec-58" java.lang.OutOfMemoryError: Java heap space
    -
    +
  • - +
  • I’m trying to find a way to determine what was using all those Tomcat sessions, but parsing the DSpace log is hard because some IPs are IPv6, which contain colons!

  • + +
  • Looking at the first crash this morning around 11, I see these IPv4 addresses making requests around 10 and 11AM:

    $ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'ip_addr=[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' | sort -n | uniq -c | sort -n | tail -n 20
    -     34 ip_addr=46.229.168.67
    -     34 ip_addr=46.229.168.73
    -     37 ip_addr=46.229.168.76
    -     40 ip_addr=34.232.65.41
    -     41 ip_addr=46.229.168.71
    -     44 ip_addr=197.210.168.174
    -     55 ip_addr=181.137.2.214
    -     55 ip_addr=213.55.99.121
    -     58 ip_addr=46.229.168.65
    -     64 ip_addr=66.249.66.91
    -     67 ip_addr=66.249.66.90
    -     71 ip_addr=207.46.13.54
    -     78 ip_addr=130.82.1.40
    -    104 ip_addr=40.77.167.36
    -    151 ip_addr=68.180.228.157
    -    174 ip_addr=207.46.13.135
    -    194 ip_addr=54.83.138.123
    -    198 ip_addr=40.77.167.62
    -    210 ip_addr=207.46.13.71
    -    214 ip_addr=104.196.152.243
    -
    + 34 ip_addr=46.229.168.67 + 34 ip_addr=46.229.168.73 + 37 ip_addr=46.229.168.76 + 40 ip_addr=34.232.65.41 + 41 ip_addr=46.229.168.71 + 44 ip_addr=197.210.168.174 + 55 ip_addr=181.137.2.214 + 55 ip_addr=213.55.99.121 + 58 ip_addr=46.229.168.65 + 64 ip_addr=66.249.66.91 + 67 ip_addr=66.249.66.90 + 71 ip_addr=207.46.13.54 + 78 ip_addr=130.82.1.40 +104 ip_addr=40.77.167.36 +151 ip_addr=68.180.228.157 +174 ip_addr=207.46.13.135 +194 ip_addr=54.83.138.123 +198 ip_addr=40.77.167.62 +210 ip_addr=207.46.13.71 +214 ip_addr=104.196.152.243 +
  • - +
  • These IPs made thousands of sessions today:

    $ grep 104.196.152.243 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     530
    @@ -373,10 +372,9 @@ $ grep 207.46.13.54 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}'
     $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     992
     
    -
    +
  • - + +
  • Nice, so these are all known bots that are already crammed into one session by Tomcat’s Crawler Session Manager Valve.

  • + +
  • What in the actual fuck, why is our load doing this? It’s gotta be something fucked up with the database pool being “busy” but everything is fucking idle

  • + +
  • One that I should probably add in nginx is 54.83.138.123, which is apparently the following user agent:

    BUbiNG (+http://law.di.unimi.it/BUbiNG.html)
    -
    +
  • - +
  • This one makes two thousand requests per day or so recently:

    # grep -c BUbiNG /var/log/nginx/access.log /var/log/nginx/access.log.1
     /var/log/nginx/access.log:1925
     /var/log/nginx/access.log.1:2029
    -
    +
  • - +
  • And they have 30 IPs, so fuck that shit I’m going to add them to the Tomcat Crawler Session Manager Valve nowwww

  • + +
  • Lots of discussions on the dspace-tech mailing list over the last few years about leaky transactions being a known problem with DSpace

  • + +
  • Helix84 recommends restarting PostgreSQL instead of Tomcat because it restarts quicker

  • + +
  • This is how the connections looked when it crashed this afternoon:

    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    -      5 dspaceApi
    -    290 dspaceWeb
    -
    + 5 dspaceApi +290 dspaceWeb +
  • - +
  • This is how it is right now:

    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    -      5 dspaceApi
    -      5 dspaceWeb
    -
    + 5 dspaceApi + 5 dspaceWeb +
  • -

    2018-02-10

    @@ -440,22 +440,18 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
  • I tried to disable ORCID lookups but keep the existing authorities
  • This item has an ORCID for Ralf Kiese: http://localhost:8080/handle/10568/89897
  • Switch authority.controlled off and change authorLookup to lookup, and the ORCID badge doesn’t show up on the item
  • -
  • Leave all settings but change choices.presentation to lookup and ORCID badge is there and item submission uses LC Name Authority and it breaks with this error: -
  • - + +
  • Leave all settings but change choices.presentation to lookup and ORCID badge is there and item submission uses LC Name Authority and it breaks with this error:

    Field dc_contributor_author has choice presentation of type "select", it may NOT be authority-controlled.
    -
    +
  • - +
  • If I change choices.presentation to suggest it give this error:

    xmlui.mirage2.forms.instancedCompositeFields.noSuggestionError
    -
    +
  • -

    2018-02-11

    @@ -467,79 +463,83 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |

    Weird thumbnail

    +
  • I downloaded the PDF and manually generated a thumbnail with ImageMagick and it looked better:

    $ convert CCAFS_WP_223.pdf\[0\] -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_cmyk.icc -thumbnail 600x600 -flatten -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_rgb.icc CCAFS_WP_223.jpg
    -
    +
  • +

    Manual thumbnail

    +
  • Peter sent me corrected author names last week but the file encoding is messed up:

    $ isutf8 authors-2018-02-05.csv
     authors-2018-02-05.csv: line 100, char 18, byte 4179: After a first byte between E1 and EC, expecting the 2nd byte between 80 and BF.
    -
    +
  • - +
  • The isutf8 program comes from moreutils

  • + +
  • Line 100 contains: Galiè, Alessandra

  • + +
  • In other news, psycopg2 is splitting their package in pip, so to install the binary wheel distribution you need to use pip install psycopg2-binary

  • + +
  • See: http://initd.org/psycopg/articles/2018/02/08/psycopg-274-released/

  • + +
  • I updated my fix-metadata-values.py and delete-metadata-values.py scripts on the scripts page: https://github.com/ilri/DSpace/wiki/Scripts

  • + +
  • I ran the 342 author corrections (after trimming whitespace and excluding those with || and other syntax errors) on CGSpace:

    $ ./fix-metadata-values.py -i Correct-342-Authors-2018-02-11.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p 'fuuu'
    -
    +
  • - +
  • Then I ran a full Discovery re-indexing:

    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    -
    +
  • - +
  • That reminds me that Bizu had asked me to fix some of Alan Duncan’s names in December

  • + +
  • I see he actually has some variations with “Duncan, Alan J.”: https://cgspace.cgiar.org/discover?filtertype_1=author&filter_relational_operator_1=contains&filter_1=Duncan%2C+Alan&submit_apply_filter=&query=

  • + +
  • I will just update those for her too and then restart the indexing:

    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
    -   text_value    |              authority               | confidence 
    +text_value    |              authority               | confidence 
     -----------------+--------------------------------------+------------
    - Duncan, Alan J. | 5ff35043-942e-4d0a-b377-4daed6e3c1a3 |        600
    - Duncan, Alan J. | 62298c84-4d9d-4b83-a932-4a9dd4046db7 |         -1
    - Duncan, Alan J. |                                      |         -1
    - Duncan, Alan    | a6486522-b08a-4f7a-84f9-3a73ce56034d |        600
    - Duncan, Alan J. | cd0e03bf-92c3-475f-9589-60c5b042ea60 |         -1
    - Duncan, Alan J. | a6486522-b08a-4f7a-84f9-3a73ce56034d |         -1
    - Duncan, Alan J. | 5ff35043-942e-4d0a-b377-4daed6e3c1a3 |         -1
    - Duncan, Alan J. | a6486522-b08a-4f7a-84f9-3a73ce56034d |        600
    +Duncan, Alan J. | 5ff35043-942e-4d0a-b377-4daed6e3c1a3 |        600
    +Duncan, Alan J. | 62298c84-4d9d-4b83-a932-4a9dd4046db7 |         -1
    +Duncan, Alan J. |                                      |         -1
    +Duncan, Alan    | a6486522-b08a-4f7a-84f9-3a73ce56034d |        600
    +Duncan, Alan J. | cd0e03bf-92c3-475f-9589-60c5b042ea60 |         -1
    +Duncan, Alan J. | a6486522-b08a-4f7a-84f9-3a73ce56034d |         -1
    +Duncan, Alan J. | 5ff35043-942e-4d0a-b377-4daed6e3c1a3 |         -1
    +Duncan, Alan J. | a6486522-b08a-4f7a-84f9-3a73ce56034d |        600
     (8 rows)
     
     dspace=# begin;
     dspace=# update metadatavalue set text_value='Duncan, Alan', authority='a6486522-b08a-4f7a-84f9-3a73ce56034d', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Duncan, Alan%';
     UPDATE 216
     dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
    -  text_value  |              authority               | confidence 
    +text_value  |              authority               | confidence 
     --------------+--------------------------------------+------------
    - Duncan, Alan | a6486522-b08a-4f7a-84f9-3a73ce56034d |        600
    +Duncan, Alan | a6486522-b08a-4f7a-84f9-3a73ce56034d |        600
     (1 row)
     dspace=# commit;
    -
    +
  • -

    2018-02-12

    @@ -557,36 +557,37 @@ dspace=# commit; + +
  • So for Abenet, I can check her submissions in December, 2017 with:

    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*yabowork.*2017-12.*';
    -
    +
  • -

    2018-02-13

    + +
  • I looked in the dspace.log.2018-02-13 and saw one recent one:

    2018-02-13 12:50:13,656 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error - 
     org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
     ...
     Caused by: java.net.SocketException: Socket closed
    -
    +
  • - +
  • Could be because of the removeAbandoned="true" that I enabled in the JDBC connection pool last week?

    $ grep -c "java.net.SocketException: Socket closed" dspace.log.2018-02-*
     dspace.log.2018-02-01:0
    @@ -602,17 +603,18 @@ dspace.log.2018-02-10:0
     dspace.log.2018-02-11:3
     dspace.log.2018-02-12:0
     dspace.log.2018-02-13:4
    -
    +
  • - +
  • I apparently added that on 2018-02-07 so it could be, as I don’t see any of those socket closed errors in 2018-01’s logs!

  • + +
  • I will increase the removeAbandonedTimeout from its default of 60 to 90 and enable logAbandoned

  • + +
  • Peter hit this issue one more time, and this is apparently what Tomcat’s catalina.out log says when an abandoned connection is removed:

    Feb 13, 2018 2:05:42 PM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
     WARNING: Connection has been abandoned PooledConnection[org.postgresql.jdbc.PgConnection@22e107be]:java.lang.Exception
    -
    +
  • +

    2018-02-14

    @@ -629,64 +631,59 @@ WARNING: Connection has been abandoned PooledConnection[org.postgresql.jdbc.PgCo
  • Alan S. Orth (0000-0002-1735-7458)
  • Atmire responded on the DSpace 5.8 compatability ticket and said they will let me know if they they want me to give them a clean 5.8 branch
  • -
  • I formatted my list of ORCID IDs as a controlled vocabulary, sorted alphabetically, then ran through XML tidy:
  • - + +
  • I formatted my list of ORCID IDs as a controlled vocabulary, sorted alphabetically, then ran through XML tidy:

    $ sort cgspace-orcids.txt > dspace/config/controlled-vocabularies/cg-creator-id.xml
     $ add XML formatting...
     $ tidy -xml -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
    -
    +
  • - +
  • It seems the tidy fucks up accents, for example it turns Adriana Tofiño (0000-0001-7115-7169) into Adriana Tofiño (0000-0001-7115-7169)

  • + +
  • We need to force UTF-8:

    $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
    -
    +
  • - +
  • This preserves special accent characters

  • + +
  • I tested the display and store of these in the XMLUI and PostgreSQL and it looks good

  • + +
  • Sisay exported all ILRI, CIAT, etc authors from ORCID and sent a list of 600+

  • + +
  • Peter combined it with mine and we have 1204 unique ORCIDs!

    $ grep -coE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_combined.csv
     1204
     $ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_combined.csv | sort | uniq | wc -l
     1204
    -
    +
  • - +
  • Also, save that regex for the future because it will be very useful!

  • + +
  • CIAT sent a list of their authors’ ORCIDs and combined with ours there are now 1227:

    $ cat CGcenter_ORCID_ID_combined.csv ciat-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
     1227
    -
    +
  • - +
  • There are some formatting issues with names in Peter’s list, so I should remember to re-generate the list of names from ORCID’s API once we’re done

  • -
     - Deleting bitstream record from database (ID: 149473)
    +
  • The dspace cleanup -v currently fails on CGSpace with the following:

    + +
    - Deleting bitstream record from database (ID: 149473)
     Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -  Detail: Key (bitstream_id)=(149473) is still referenced from table "bundle".
    -
    +Detail: Key (bitstream_id)=(149473) is still referenced from table "bundle". +
  • - +
  • The solution is to update the bitstream table, as I’ve discovered several other times in 2016 and 2017:

    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (149473);'
     UPDATE 1
    -
    +
  • -

    2018-02-15

    @@ -703,61 +700,57 @@ UPDATE 1
  • Send emails to Macaroni Bros and Usman at CIFOR about ORCID metadata
  • CGSpace crashed while I was driving to Tel Aviv, and was down for four hours!
  • I only looked quickly in the logs but saw a bunch of database errors
  • -
  • PostgreSQL connections are currently:
  • - + +
  • PostgreSQL connections are currently:

    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | uniq -c
    -      2 dspaceApi
    -      1 dspaceWeb
    -      3 dspaceApi
    -
    + 2 dspaceApi + 1 dspaceWeb + 3 dspaceApi +
  • - +
  • I see shitloads of memory errors in Tomcat’s logs:

    # grep -c "Java heap space" /var/log/tomcat7/catalina.out
     56
    -
    +
  • - +
  • And shit tons of database connections abandoned:

    # grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' /var/log/tomcat7/catalina.out
     612
    -
    +
  • - +
  • I have no fucking idea why it crashed

  • + +
  • The XMLUI activity looks like:

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "15/Feb/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    715 63.143.42.244
    -    746 213.55.99.121
    -    886 68.180.228.157
    -    967 66.249.66.90
    -   1013 216.244.66.245
    -   1177 197.210.168.174
    -   1419 207.46.13.159
    -   1512 207.46.13.59
    -   1554 207.46.13.157
    -   2018 104.196.152.243
    -
    +715 63.143.42.244 +746 213.55.99.121 +886 68.180.228.157 +967 66.249.66.90 +1013 216.244.66.245 +1177 197.210.168.174 +1419 207.46.13.159 +1512 207.46.13.59 +1554 207.46.13.157 +2018 104.196.152.243 +
  • +

    2018-02-17

    + +
  • I should remember to update existing values in PostgreSQL too:

    dspace=# update metadatavalue set text_value='United States Agency for International Development' where resource_type_id=2 and metadata_field_id=29 and text_value like '%U.S. Agency for International Development%';
     UPDATE 2
    -
    +
  • +

    2018-02-18

    @@ -772,63 +765,61 @@ UPDATE 2
  • The one on the bottom left uses a similar format to our author display, and the one in the middle uses the format recommended by ORCID’s branding guidelines
  • Also, I realized that the Academicons font icon set we’re using includes an ORCID badge so we don’t need to use the PNG image anymore
  • Run system updates on DSpace Test (linode02) and reboot the server
  • -
  • Looking back at the system errors on 2018-02-15, I wonder what the fuck caused this:
  • - + +
  • Looking back at the system errors on 2018-02-15, I wonder what the fuck caused this:

    $ wc -l dspace.log.2018-02-1{0..8}
    -   383483 dspace.log.2018-02-10
    -   275022 dspace.log.2018-02-11
    -   249557 dspace.log.2018-02-12
    -   280142 dspace.log.2018-02-13
    -   615119 dspace.log.2018-02-14
    -  4388259 dspace.log.2018-02-15
    -   243496 dspace.log.2018-02-16
    -   209186 dspace.log.2018-02-17
    -   167432 dspace.log.2018-02-18
    -
    +383483 dspace.log.2018-02-10 +275022 dspace.log.2018-02-11 +249557 dspace.log.2018-02-12 +280142 dspace.log.2018-02-13 +615119 dspace.log.2018-02-14 +4388259 dspace.log.2018-02-15 +243496 dspace.log.2018-02-16 +209186 dspace.log.2018-02-17 +167432 dspace.log.2018-02-18 +
  • - +
  • From an average of a few hundred thousand to over four million lines in DSpace log?

  • + +
  • Using grep’s -B1 I can see the line before the heap space error, which has the time, ie:

    2018-02-15 16:02:12,748 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
     org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
    -
    +
  • - +
  • So these errors happened at hours 16, 18, 19, and 20

  • + +
  • Let’s see what was going on in nginx then:

    # zcat --force /var/log/nginx/*.log.{3,4}.gz | wc -l
     168571
     # zcat --force /var/log/nginx/*.log.{3,4}.gz | grep -E "15/Feb/2018:(16|18|19|20)" | wc -l
     8188
    -
    +
  • - +
  • Only 8,000 requests during those four hours, out of 170,000 the whole day!

  • + +
  • And the usage of XMLUI, REST, and OAI looks SUPER boring:

    # zcat --force /var/log/nginx/*.log.{3,4}.gz | grep -E "15/Feb/2018:(16|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    111 95.108.181.88
    -    158 45.5.184.221
    -    201 104.196.152.243
    -    205 68.180.228.157
    -    236 40.77.167.131 
    -    253 207.46.13.159
    -    293 207.46.13.59
    -    296 63.143.42.242
    -    303 207.46.13.157
    -    416 63.143.42.244
    -
    +111 95.108.181.88 +158 45.5.184.221 +201 104.196.152.243 +205 68.180.228.157 +236 40.77.167.131 +253 207.46.13.159 +293 207.46.13.59 +296 63.143.42.242 +303 207.46.13.157 +416 63.143.42.244 +
  • -

    PostgreSQL locks

    @@ -841,58 +832,53 @@ org.springframework.web.util.NestedServletException: Handler processing failed;

    2018-02-19

    +
  • Combined list of CGIAR author ORCID iDs is up to 1,500:

    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI-csv.csv CGcenter_ORCID_ID_combined.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l  
     1571
    -
    +
  • - +
  • I updated my resolve-orcids-from-solr.py script to be able to resolve ORCID identifiers from a text file so I renamed it to resolve-orcids.py

  • + +
  • Also, I updated it so it uses several new options:

    $ ./resolve-orcids.py -i input.txt -o output.txt
     $ cat output.txt 
     Ali Ramadhan: 0000-0001-5019-1368
     Ahmad Maryudi: 0000-0001-5051-7217
    -
    +
  • - +
  • I was running this on the new list of 1571 and found an error:

    Looking up the name associated with ORCID iD: 0000-0001-9634-1958
     Traceback (most recent call last):
    -  File "./resolve-orcids.py", line 111, in <module>
    -    read_identifiers_from_file()
    -  File "./resolve-orcids.py", line 37, in read_identifiers_from_file
    -    resolve_orcid_identifiers(orcids)
    -  File "./resolve-orcids.py", line 65, in resolve_orcid_identifiers
    -    family_name = data['name']['family-name']['value']
    +File "./resolve-orcids.py", line 111, in <module>
    +read_identifiers_from_file()
    +File "./resolve-orcids.py", line 37, in read_identifiers_from_file
    +resolve_orcid_identifiers(orcids)
    +File "./resolve-orcids.py", line 65, in resolve_orcid_identifiers
    +family_name = data['name']['family-name']['value']
     TypeError: 'NoneType' object is not subscriptable
    -
    +
  • - +
  • According to ORCID that identifier’s family-name is null so that sucks

  • + +
  • I fixed the script so that it checks if the family name is null

  • + +
  • Now another:

    Looking up the name associated with ORCID iD: 0000-0002-1300-3636
     Traceback (most recent call last):
    -  File "./resolve-orcids.py", line 117, in <module>
    -    read_identifiers_from_file()
    -  File "./resolve-orcids.py", line 37, in read_identifiers_from_file
    -    resolve_orcid_identifiers(orcids)
    -  File "./resolve-orcids.py", line 65, in resolve_orcid_identifiers
    -    if data['name']['given-names']:
    +File "./resolve-orcids.py", line 117, in <module>
    +read_identifiers_from_file()
    +File "./resolve-orcids.py", line 37, in read_identifiers_from_file
    +resolve_orcid_identifiers(orcids)
    +File "./resolve-orcids.py", line 65, in resolve_orcid_identifiers
    +if data['name']['given-names']:
     TypeError: 'NoneType' object is not subscriptable
    -
    +
  • -

    2018-02-20

    @@ -900,17 +886,17 @@ TypeError: 'NoneType' object is not subscriptable + +
  • This should be the version we use (the existing controlled vocabulary generated from CGSpace’s Solr authority core plus the IDs sent to us so far by partners):

    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > 2018-02-20-combined.txt
    -
    +
  • - +
  • I updated the resolve-orcids.py to use the “credit-name” if it exists in a profile, falling back to “given-names” + “family-name”

  • + +
  • Also, I added color coded output to the debug messages and added a “quiet” mode that supresses the normal behavior of printing results to the screen

  • + +
  • I’m using this as the test input for resolve-orcids.py:

    $ cat orcid-test-values.txt 
     # valid identifier with 'given-names' and 'family-name'
    @@ -936,19 +922,27 @@ TypeError: 'NoneType' object is not subscriptable
     
     # missing ORCID identifier
     0000-0003-4221-3214
    -
    +
  • -

    2018-02-22

    @@ -956,60 +950,55 @@ TypeError: 'NoneType' object is not subscriptable + +
  • There was absolutely nothing interesting going on at 13:00 on the server, WTF?

    # cat /var/log/nginx/*.log | grep -E "22/Feb/2018:13" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -     55 192.99.39.235
    -     60 207.46.13.26
    -     62 40.77.167.38
    -     65 207.46.13.23
    -    103 41.57.108.208
    -    120 104.196.152.243
    -    133 104.154.216.0
    -    145 68.180.228.117
    -    159 54.92.197.82
    -    231 5.9.6.51
    -
    + 55 192.99.39.235 + 60 207.46.13.26 + 62 40.77.167.38 + 65 207.46.13.23 +103 41.57.108.208 +120 104.196.152.243 +133 104.154.216.0 +145 68.180.228.117 +159 54.92.197.82 +231 5.9.6.51 +
  • - +
  • Otherwise there was pretty normal traffic the rest of the day:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "22/Feb/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    839 216.244.66.245
    -   1074 68.180.228.117
    -   1114 157.55.39.100
    -   1162 207.46.13.26
    -   1178 207.46.13.23
    -   2749 104.196.152.243
    -   3109 50.116.102.77
    -   4199 70.32.83.92
    -   5208 5.9.6.51
    -   8686 45.5.184.196
    -
    +839 216.244.66.245 +1074 68.180.228.117 +1114 157.55.39.100 +1162 207.46.13.26 +1178 207.46.13.23 +2749 104.196.152.243 +3109 50.116.102.77 +4199 70.32.83.92 +5208 5.9.6.51 +8686 45.5.184.196 +
  • - +
  • So I don’t see any definite cause for this crash, I see a shit ton of abandoned PostgreSQL connections today around 1PM!

    # grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' /var/log/tomcat7/catalina.out
     729
     # grep 'Feb 22, 2018 1' /var/log/tomcat7/catalina.out | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' 
     519
    -
    +
  • - +
  • I think the removeAbandonedTimeout might still be too low (I increased it from 60 to 90 seconds last week)

  • + +
  • Abandoned connections is not a cause but a symptom, though perhaps something more like a few minutes is better?

  • + +
  • Also, while looking at the logs I see some new bot:

    Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.4.2661.102 Safari/537.36; 360Spider
    -
    +
  • -

    2018-02-23

    @@ -1022,74 +1011,74 @@ TypeError: 'NoneType' object is not subscriptable + +
  • We currently have 988 unique identifiers:

    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l          
     988
    -
    +
  • - +
  • After adding the ones from CCAFS we now have 1004:

    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ccafs | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
     1004
    -
    +
  • - +
  • I will add them to DSpace Test but Abenet says she’s still waiting to set us ILRI’s list

  • + +
  • I will tell her that we should proceed on sharing our work on DSpace Test with the partners this week anyways and we can update the list later

  • + +
  • While regenerating the names for these ORCID identifiers I saw one that has a weird value for its names:

    Looking up the names associated with ORCID iD: 0000-0002-2614-426X
     Given Names Deactivated Family Name Deactivated: 0000-0002-2614-426X
    -
    +
  • - +
  • I don’t know if the user accidentally entered this as their name or if that’s how ORCID behaves when the name is private?

  • + +
  • I will remove that one from our list for now

  • + +
  • Remove Dryland Systems subject from submission form because that CRP closed two years ago (#355)

  • + +
  • Run all system updates on DSpace Test

  • + +
  • Email ICT to ask how to proceed with the OCS proforma issue for the new DSpace Test server on Linode

  • + +
  • Thinking about how to preserve ORCID identifiers attached to existing items in CGSpace

  • + +
  • We have over 60,000 unique author + authority combinations on CGSpace:

    dspace=# select count(distinct (text_value, authority)) from metadatavalue where resource_type_id=2 and metadata_field_id=3;
    - count 
    +count 
     -------
    - 62464
    +62464
     (1 row)
    -
    +
  • - +
  • I know from earlier this month that there are only 624 unique ORCID identifiers in the Solr authority core, so it’s way easier to just fetch the unique ORCID iDs from Solr and then go back to PostgreSQL and do the metadata mapping that way

  • + +
  • The query in Solr would simply be orcid_id:*

  • + +
  • Assuming I know that authority record with id:d7ef744b-bbd4-4171-b449-00e37e1b776f, then I could query PostgreSQL for all metadata records using that authority:

    dspace=# select * from metadatavalue where resource_type_id=2 and authority='d7ef744b-bbd4-4171-b449-00e37e1b776f';
    - metadata_value_id | resource_id | metadata_field_id |        text_value         | text_lang | place |              authority               | confidence | resource_type_id 
    +metadata_value_id | resource_id | metadata_field_id |        text_value         | text_lang | place |              authority               | confidence | resource_type_id 
     -------------------+-------------+-------------------+---------------------------+-----------+-------+--------------------------------------+------------+------------------
    -           2726830 |       77710 |                 3 | Rodríguez Chalarca, Jairo |           |     2 | d7ef744b-bbd4-4171-b449-00e37e1b776f |        600 |                2
    +       2726830 |       77710 |                 3 | Rodríguez Chalarca, Jairo |           |     2 | d7ef744b-bbd4-4171-b449-00e37e1b776f |        600 |                2
     (1 row)
    -
    +
  • - +
  • Then I suppose I can use the resource_id to identify the item?

  • + +
  • Actually, resource_id is the same id we use in CSV, so I could simply build something like this for a metadata import!

    id,cg.creator.id
     93848,Alan S. Orth: 0000-0002-1735-7458||Peter G. Ballantyne: 0000-0001-9346-2893
    -
    +
  • - +
  • I just discovered that requests-cache can transparently cache HTTP requests

  • + +
  • Running resolve-orcids.py with my test input takes 10.5 seconds the first time, and then 3.0 seconds the second time!

    $ time ./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names
     Ali Ramadhan: 0000-0001-5019-1368
    @@ -1103,7 +1092,8 @@ Alan S. Orth: 0000-0002-1735-7458
     Ibrahim Mohammed: 0000-0001-5199-5528
     Nor Azwadi: 0000-0001-9634-1958
     ./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names  0.23s user 0.05s system 8% cpu 3.046 total
    -
    +
  • +

    2018-02-26

    @@ -1122,87 +1112,82 @@ Nor Azwadi: 0000-0001-9634-1958
  • I have disabled removeAbandoned for now because that’s the only thing I changed in the last few weeks since he started having issues
  • I think the real line of logic to follow here is why the submissions page is so slow for him (presumably because of loading all his submissions?)
  • I need to see which SQL queries are run during that time
  • -
  • And only a few hours after I disabled the removeAbandoned thing CGSpace went down and lo and behold, there were 264 connections, most of which were idle:
  • - + +
  • And only a few hours after I disabled the removeAbandoned thing CGSpace went down and lo and behold, there were 264 connections, most of which were idle:

    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    -      5 dspaceApi
    -    279 dspaceWeb
    +  5 dspaceApi
    +279 dspaceWeb
     $ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c "idle in transaction"
     218
    -
    +
  • - +
  • So I’m re-enabling the removeAbandoned setting

  • + +
  • I grabbed a snapshot of the active connections in pg_stat_activity for all queries running longer than 2 minutes:

    dspace=# \copy (SELECT now() - query_start as "runtime", application_name, usename, datname, waiting, state, query
    -  FROM  pg_stat_activity
    -  WHERE now() - query_start > '2 minutes'::interval
    - ORDER BY runtime DESC) to /tmp/2018-02-27-postgresql.txt
    +FROM  pg_stat_activity
    +WHERE now() - query_start > '2 minutes'::interval
    +ORDER BY runtime DESC) to /tmp/2018-02-27-postgresql.txt
     COPY 263
    -
    +
  • - +
  • 100 of these idle in transaction connections are the following query:

    SELECT * FROM resourcepolicy WHERE resource_type_id= $1 AND resource_id= $2 AND action_id= $3
    -
    +
  • - +
  • … but according to the pg_locks documentation I should have done this to correlate the locks with the activity:

    SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;
    -
    +
  • -

    2018-02-28

    + +
  • There’s nothing interesting going on in nginx’s logs around that time:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "28/Feb/2018:09:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -     65 197.210.168.174
    -     74 213.55.99.121
    -     74 66.249.66.90
    -     86 41.204.190.40
    -    102 130.225.98.207
    -    108 192.0.89.192
    -    112 157.55.39.218
    -    129 207.46.13.21
    -    131 207.46.13.115
    -    135 207.46.13.101
    -
    + 65 197.210.168.174 + 74 213.55.99.121 + 74 66.249.66.90 + 86 41.204.190.40 +102 130.225.98.207 +108 192.0.89.192 +112 157.55.39.218 +129 207.46.13.21 +131 207.46.13.115 +135 207.46.13.101 +
  • - +
  • Looking in dspace.log-2018-02-28 I see this, though:

    2018-02-28 09:19:29,692 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
     org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
    -
    +
  • - +
  • Memory issues seem to be common this month:

    $ grep -c 'nested exception is java.lang.OutOfMemoryError: Java heap space' dspace.log.2018-02-* 
     dspace.log.2018-02-01:0
    @@ -1233,51 +1218,53 @@ dspace.log.2018-02-25:0
     dspace.log.2018-02-26:0
     dspace.log.2018-02-27:6
     dspace.log.2018-02-28:1
    -
    +
  • - +
  • Top ten users by session during the first twenty minutes of 9AM:

    $ grep -E '2018-02-28 09:(0|1)' dspace.log.2018-02-28 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq -c | sort -n | tail -n 10
    -     18 session_id=F2DFF64D3D707CD66AE3A873CEC80C49
    -     19 session_id=92E61C64A79F0812BE62A3882DA8F4BA
    -     21 session_id=57417F5CB2F9E3871E609CEEBF4E001F
    -     25 session_id=C3CD265AB7AA51A49606C57C069A902A
    -     26 session_id=E395549F081BA3D7A80F174AE6528750
    -     26 session_id=FEE38CF9760E787754E4480069F11CEC
    -     33 session_id=C45C2359AE5CD115FABE997179E35257
    -     38 session_id=1E9834E918A550C5CD480076BC1B73A4
    -     40 session_id=8100883DAD00666A655AE8EC571C95AE
    -     66 session_id=01D9932D6E85E90C2BA9FF5563A76D03
    -
    + 18 session_id=F2DFF64D3D707CD66AE3A873CEC80C49 + 19 session_id=92E61C64A79F0812BE62A3882DA8F4BA + 21 session_id=57417F5CB2F9E3871E609CEEBF4E001F + 25 session_id=C3CD265AB7AA51A49606C57C069A902A + 26 session_id=E395549F081BA3D7A80F174AE6528750 + 26 session_id=FEE38CF9760E787754E4480069F11CEC + 33 session_id=C45C2359AE5CD115FABE997179E35257 + 38 session_id=1E9834E918A550C5CD480076BC1B73A4 + 40 session_id=8100883DAD00666A655AE8EC571C95AE + 66 session_id=01D9932D6E85E90C2BA9FF5563A76D03 +
  • - +
  • According to the log 01D9932D6E85E90C2BA9FF5563A76D03 is an ILRI editor, doing lots of updating and editing of items

  • + +
  • 8100883DAD00666A655AE8EC571C95AE is some Indian IP address

  • + +
  • 1E9834E918A550C5CD480076BC1B73A4 looks to be a session shared by the bots

  • + +
  • So maybe it was due to the editor’s uploading of files, perhaps something that was too big or?

  • + +
  • I think I’ll increase the JVM heap size on CGSpace from 6144m to 8192m because I’m sick of this random crashing shit and the server has memory and I’d rather eliminate this so I can get back to solving PostgreSQL issues and doing other real work

  • + +
  • Run the few corrections from earlier this month for sponsor on CGSpace:

    cgspace=# update metadatavalue set text_value='United States Agency for International Development' where resource_type_id=2 and metadata_field_id=29 and text_value like '%U.S. Agency for International Development%';
     UPDATE 3
    -
    +
  • - +
  • I finally got a CGIAR account so I logged into CGSpace with it and tried to delete my old unfinished submissions (22 of them)

  • + +
  • Eventually it succeeded, but it took about five minutes and I noticed LOTS of locks happening with this query:

    dspace=# \copy (SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid) to /tmp/locks-aorth.txt;
    -
    +
  • - diff --git a/docs/2018-03/index.html b/docs/2018-03/index.html index e7eecc823..739bc78e9 100644 --- a/docs/2018-03/index.html +++ b/docs/2018-03/index.html @@ -23,7 +23,7 @@ Export a CSV of the IITA community metadata for Martin Mueller Export a CSV of the IITA community metadata for Martin Mueller "/> - + @@ -115,34 +115,34 @@ Export a CSV of the IITA community metadata for Martin Mueller
  • Andrea from Macaroni Bros had sent me an email that CCAFS needs them
  • Give Udana more feedback on his WLE records from last month
  • There were some records using a non-breaking space in their AGROVOC subject field
  • -
  • I checked and tested some author corrections from Peter from last week, and then applied them on CGSpace
  • - + +
  • I checked and tested some author corrections from Peter from last week, and then applied them on CGSpace

    $ ./fix-metadata-values.py -i Correct-309-authors-2018-03-06.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3      
     $ ./delete-metadata-values.py -i Delete-3-Authors-2018-03-06.csv -db dspace -u dspace-p 'fuuu' -f dc.contributor.author -m 3
    -
    +
  • - +
  • This time there were no errors in whitespace but I did have to correct one incorrectly encoded accent character

  • + +
  • Add new CRP subject “GRAIN LEGUMES AND DRYLAND CEREALS” to input-forms.xml (#358)

  • + +
  • Merge the ORCID integration stuff in to 5_x-prod for deployment on CGSpace soon (#359)

  • + +
  • Deploy ORCID changes on CGSpace (linode18), run all system updates, and reboot the server

  • + +
  • Run all system updates on DSpace Test and reboot server

  • + +
  • I ran the orcid-authority-to-item.py script on CGSpace and mapped 2,864 ORCID identifiers from Solr to item metadata

    $ ./orcid-authority-to-item.py -db dspace -u dspace -p 'fuuu' -s http://localhost:8081/solr -d
    -
    +
  • - +
  • I ran the DSpace cleanup script on CGSpace and it threw an error (as always):

    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -  Detail: Key (bitstream_id)=(150659) is still referenced from table "bundle".
    -
    +Detail: Key (bitstream_id)=(150659) is still referenced from table "bundle". +
  • - + +
  • Around that time there were an increase of SQL errors:

    2018-03-19 09:10:54,856 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
     ...
     2018-03-19 09:10:54,862 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL query singleTable Error -
    -
    +
  • - +
  • But these errors, I don’t even know what they mean, because a handful of them happen every day:

    $ grep -c 'ERROR org.dspace.storage.rdbms.DatabaseManager' dspace.log.2018-03-1*
     dspace.log.2018-03-10:13
    @@ -394,103 +388,105 @@ dspace.log.2018-03-16:13
     dspace.log.2018-03-17:13
     dspace.log.2018-03-18:15
     dspace.log.2018-03-19:90
    -
    +
  • - +
  • There wasn’t even a lot of traffic at the time (8–9 AM):

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "19/Mar/2018:0[89]:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -     92 40.77.167.197
    -     92 83.103.94.48
    -     96 40.77.167.175
    -    116 207.46.13.178
    -    122 66.249.66.153
    -    140 95.108.181.88
    -    196 213.55.99.121
    -    206 197.210.168.174
    -    207 104.196.152.243
    -    294 54.198.169.202
    -
    + 92 40.77.167.197 + 92 83.103.94.48 + 96 40.77.167.175 +116 207.46.13.178 +122 66.249.66.153 +140 95.108.181.88 +196 213.55.99.121 +206 197.210.168.174 +207 104.196.152.243 +294 54.198.169.202 +
  • - +
  • Well there is a hint in Tomcat’s catalina.out:

    Mon Mar 19 09:05:28 UTC 2018 | Query:id: 92032 AND type:2
     Exception in thread "http-bio-127.0.0.1-8081-exec-280" java.lang.OutOfMemoryError: Java heap space
    -
    +
  • -

    2018-03-20

    +
  • DSpace Test has been down for a few hours with SQL and memory errors starting this morning:

    2018-03-20 08:47:10,177 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
     ...
     2018-03-20 08:53:11,624 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
     org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
    -
    +
  • - +
  • I have no idea why it crashed

  • + +
  • I ran all system updates and rebooted it

  • + +
  • Abenet told me that one of Lance Robinson’s ORCID iDs on CGSpace is incorrect

  • + +
  • I will remove it from the controlled vocabulary (#367) and update any items using the old one:

    dspace=# update metadatavalue set text_value='Lance W. Robinson: 0000-0002-5224-8644' where resource_type_id=2 and metadata_field_id=240 and text_value like '%0000-0002-6344-195X%';
     UPDATE 1
    -
    +
  • - +
  • Communicate with DSpace editors on Yammer about being more careful about spaces and character editing when doing manual metadata edits

  • + +
  • Merge the changes to CRP names to the 5_x-prod branch and deploy on CGSpace (#363)

  • + +
  • Run corrections for CRP names in the database:

    $ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
    -
    +
  • - +
  • Run all system updates on CGSpace (linode18) and reboot the server

  • + +
  • I started a full Discovery re-index on CGSpace because of the updated CRPs

  • + +
  • I see this error in the DSpace log:

    2018-03-20 19:03:14,844 ERROR com.atmire.dspace.discovery.AtmireSolrService @ No choices plugin was configured for  field "dc_contributor_author".
     java.lang.IllegalArgumentException: No choices plugin was configured for  field "dc_contributor_author".
    -        at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:261)
    -        at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:249)
    -        at org.dspace.browse.SolrBrowseCreateDAO.additionalIndex(SolrBrowseCreateDAO.java:215)
    -        at com.atmire.dspace.discovery.AtmireSolrService.buildDocument(AtmireSolrService.java:662)
    -        at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:807)
    -        at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:876)
    -        at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370)
    -        at org.dspace.discovery.IndexClient.main(IndexClient.java:117)
    -        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -        at java.lang.reflect.Method.invoke(Method.java:498)
    -        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    -        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    -
    + at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:261) + at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:249) + at org.dspace.browse.SolrBrowseCreateDAO.additionalIndex(SolrBrowseCreateDAO.java:215) + at com.atmire.dspace.discovery.AtmireSolrService.buildDocument(AtmireSolrService.java:662) + at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:807) + at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:876) + at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370) + at org.dspace.discovery.IndexClient.main(IndexClient.java:117) + at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) + at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) + at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) + at java.lang.reflect.Method.invoke(Method.java:498) + at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226) + at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78) +
  • -

    2018-03-21

    @@ -516,75 +512,71 @@ COPY 56156 + +
  • CGSpace crashed again this afternoon, I’m not sure of the cause but there are a lot of SQL errors in the DSpace log:

    2018-03-21 15:11:08,166 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error - 
     java.sql.SQLException: Connection has already been closed.
    -
    +
  • - +
  • I have no idea why so many connections were abandoned this afternoon:

    # grep 'Mar 21, 2018' /var/log/tomcat7/catalina.out | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
     268
    -
    +
  • - +
  • DSpace Test crashed again due to Java heap space, this is from the DSpace log:

    2018-03-21 15:18:48,149 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
     org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
    -
    +
  • - +
  • And this is from the Tomcat Catalina log:

    Mar 21, 2018 11:20:00 AM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run
     SEVERE: Unexpected death of background thread ContainerBackgroundProcessor[StandardEngine[Catalina]]
     java.lang.OutOfMemoryError: Java heap space
    -
    +
  • - +
  • But there are tons of heap space errors on DSpace Test actually:

    # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
     319
    -
    +
  • - +
  • I guess we need to give it more RAM because it now has CGSpace’s large Solr core

  • + +
  • I will increase the memory from 3072m to 4096m

  • + +
  • Update Ansible playbooks to use PostgreSQL JBDC driver 42.2.2

  • + +
  • Deploy the new JDBC driver on DSpace Test

  • + +
  • I’m also curious to see how long the dspace index-discovery -b takes on DSpace Test where the DSpace installation directory is on one of Linode’s new block storage volumes

    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    208m19.155s
     user    8m39.138s
     sys     2m45.135s
    -
    +
  • - +
  • So that’s about three times as long as it took on CGSpace this morning

  • + +
  • I should also check the raw read speed with hdparm -tT /dev/sdc

  • + +
  • Looking at Peter’s author corrections there are some mistakes due to Windows 1252 encoding

  • + +
  • I need to find a way to filter these easily with OpenRefine

  • + +
  • For example, Peter has inadvertantly introduced Unicode character 0xfffd into several fields

  • + +
  • I can search for Unicode values by their hex code in OpenRefine using the following GREL expression:

    isNotNull(value.match(/.*\ufffd.*/))
    -
    +
  • -

    2018-03-22

    @@ -605,36 +597,31 @@ sys 2m45.135s + +
  • I can find all names that have acceptable characters using a GREL expression like:

    isNotNull(value.match(/.*[a-zA-ZáÁéèïíñØøöóúü].*/))
    -
    +
  • - +
  • But it’s probably better to just say which characters I know for sure are not valid (like parentheses, pipe, or weird Unicode characters):

    or(
    -  isNotNull(value.match(/.*[(|)].*/)),
    -  isNotNull(value.match(/.*\uFFFD.*/)),
    -  isNotNull(value.match(/.*\u00A0.*/)),
    -  isNotNull(value.match(/.*\u200A.*/))
    +isNotNull(value.match(/.*[(|)].*/)),
    +isNotNull(value.match(/.*\uFFFD.*/)),
    +isNotNull(value.match(/.*\u00A0.*/)),
    +isNotNull(value.match(/.*\u200A.*/))
     )
    -
    +
  • - +
  • And here’s one combined GREL expression to check for items marked as to delete or check so I can flag them and export them to a separate CSV (though perhaps it’s time to add delete support to my fix-metadata-values.py script:

    or(
    -  isNotNull(value.match(/.*delete.*/i)),
    -  isNotNull(value.match(/.*remove.*/i)),
    -  isNotNull(value.match(/.*check.*/i))
    +isNotNull(value.match(/.*delete.*/i)),
    +isNotNull(value.match(/.*remove.*/i)),
    +isNotNull(value.match(/.*check.*/i))
     )
    -
    +
  • - +
  • Test the corrections and deletions locally, then run them on CGSpace:

    $ ./fix-metadata-values.py -i /tmp/Correct-2928-Authors-2018-03-21.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
     $ ./delete-metadata-values.py -i /tmp/Delete-8-Authors-2018-03-21.csv -f dc.contributor.author -m 3 -db dspacetest -u dspace -p 'fuuu'
    -
    +
  • -

    2018-03-26

    @@ -674,16 +661,15 @@ $ ./delete-metadata-values.py -i /tmp/Delete-8-Authors-2018-03-21.csv -f dc.cont + +
  • The error in Tomcat’s catalina.out was:

    Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java heap space
    -
    +
  • - +
  • Add ISI Journal (cg.isijournal) as an option in Atmire’s Listing and Reports layout (#370) for Abenet

  • + +
  • I noticed a few hundred CRPs using the old capitalized formatting so I corrected them:

    $ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db cgspace -u cgspace -p 'fuuu'
     Fixed 29 occurences of: CLIMATE CHANGE, AGRICULTURE AND FOOD SECURITY
    @@ -696,14 +682,17 @@ Fixed 11 occurences of: POLICIES, INSTITUTIONS, AND MARKETS
     Fixed 28 occurences of: GRAIN LEGUMES
     Fixed 3 occurences of: FORESTS, TREES AND AGROFORESTRY
     Fixed 5 occurences of: GENEBANKS
    -
    +
  • - diff --git a/docs/2018-04/index.html b/docs/2018-04/index.html index da11df147..4f48f9713 100644 --- a/docs/2018-04/index.html +++ b/docs/2018-04/index.html @@ -25,7 +25,7 @@ Catalina logs at least show some memory errors yesterday: I tried to test something on DSpace Test but noticed that it’s down since god knows when Catalina logs at least show some memory errors yesterday: "/> - + @@ -130,16 +130,14 @@ Exception in thread "ContainerBackgroundProcessor[StandardEngine[Catalina]] + +
  • For completeness I re-ran the CRP corrections on CGSpace:

    $ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
     Fixed 1 occurences of: AGRICULTURE FOR NUTRITION AND HEALTH
    -
    +
  • - +
  • Then started a full Discovery index:

    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx1024m'
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    @@ -147,18 +145,16 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     real    76m13.841s
     user    8m22.960s
     sys     2m2.498s
    -
    +
  • - +
  • Elizabeth from CIAT emailed to ask if I could help her by adding ORCID identifiers to all of Joseph Tohme’s items

  • + +
  • I used my add-orcid-identifiers-csv.py script:

    $ ./add-orcid-identifiers-csv.py -i /tmp/jtohme-2018-04-04.csv -db dspace -u dspace -p 'fuuu'
    -
    +
  • -
    dc.contributor.author,cg.creator.id
    @@ -168,16 +164,15 @@ sys     2m2.498s
     
    +
    +
  • I started preparing the git branch for the the DSpace 5.5→5.8 upgrade:

    $ git checkout -b 5_x-dspace-5.8 5_x-prod
     $ git reset --hard ilri/5_x-prod
     $ git rebase -i dspace-5.8
    -
    +
  • -

    2018-04-05

    + +
  • The reindexing process on DSpace Test took forever yesterday:

    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    599m32.961s
     user    9m3.947s
     sys     2m52.585s
    -
    +
  • - +
  • So we really should not use this Linode block storage for Solr

  • + +
  • Assetstore might be fine but would complicate things with configuration and deployment (ughhh)

  • + +
  • Better to use Linode block storage only for backup

  • + +
  • Help Peter with the GDPR compliance / reporting form for CGSpace

  • + +
  • DSpace Test crashed due to memory issues again:

    # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
     16
    -
    +
  • -

    2018-04-10

    + +
  • Looking at the nginx logs, here are the top users today so far:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Apr/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10                                                                                                   
    -    282 207.46.13.112
    -    286 54.175.208.220
    -    287 207.46.13.113
    -    298 66.249.66.153
    -    322 207.46.13.114
    -    780 104.196.152.243
    -   3994 178.154.200.38
    -   4295 70.32.83.92
    -   4388 95.108.181.88
    -   7653 45.5.186.2
    -
    +282 207.46.13.112 +286 54.175.208.220 +287 207.46.13.113 +298 66.249.66.153 +322 207.46.13.114 +780 104.196.152.243 +3994 178.154.200.38 +4295 70.32.83.92 +4388 95.108.181.88 +7653 45.5.186.2 +
  • - +
  • 45.5.186.2 is of course CIAT

  • + +
  • 95.108.181.88 appears to be Yandex:

    95.108.181.88 - - [09/Apr/2018:06:34:16 +0000] "GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1" 200 2638 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
    -
    +
  • - +
  • And for some reason Yandex created a lot of Tomcat sessions today:

    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-04-10
     4363
    -
    +
  • - +
  • 70.32.83.92 appears to be some harvester we’ve seen before, but on a new IP

  • + +
  • They are not creating new Tomcat sessions so there is no problem there

  • + +
  • 178.154.200.38 also appears to be Yandex, and is also creating many Tomcat sessions:

    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=178.154.200.38' dspace.log.2018-04-10
     3982
    -
    +
  • - +
  • I’m not sure why Yandex creates so many Tomcat sessions, as its user agent should match the Crawler Session Manager valve

  • + +
  • Let’s try a manual request with and without their user agent:

    $ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg 'User-Agent:Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)'
     GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1
    @@ -321,19 +318,19 @@ X-Cocoon-Version: 2.2.0
     X-Content-Type-Options: nosniff
     X-Frame-Options: SAMEORIGIN
     X-XSS-Protection: 1; mode=block
    -
    +
  • -

    Tomcat sessions week

    +
  • In other news, it looks like the number of total requests processed by nginx in March went down from the previous months:

    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Mar/2018"
     2266594
    @@ -341,85 +338,84 @@ X-XSS-Protection: 1; mode=block
     real    0m13.658s
     user    0m16.533s
     sys     0m1.087s
    -
    +
  • - +
  • In other other news, the database cleanup script has an issue again:

    $ dspace cleanup -v
     ...
     Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -  Detail: Key (bitstream_id)=(151626) is still referenced from table "bundle".
    -
    +Detail: Key (bitstream_id)=(151626) is still referenced from table "bundle". +
  • - +
  • The solution is, as always:

    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (151626);'
     UPDATE 1
    -
    +
  • - +
  • Looking at abandoned connections in Tomcat:

    # zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
     2115
    -
    +
  • - +
  • Apparently from these stacktraces we should be able to see which code is not closing connections properly

  • + +
  • Here’s a pretty good overview of days where we had database issues recently:

    # zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' | awk '{print $1,$2, $3}' | sort | uniq -c | sort -n
    -      1 Feb 18, 2018
    -      1 Feb 19, 2018
    -      1 Feb 20, 2018
    -      1 Feb 24, 2018
    -      2 Feb 13, 2018
    -      3 Feb 17, 2018
    -      5 Feb 16, 2018
    -      5 Feb 23, 2018
    -      5 Feb 27, 2018
    -      6 Feb 25, 2018
    -     40 Feb 14, 2018
    -     63 Feb 28, 2018
    -    154 Mar 19, 2018
    -    202 Feb 21, 2018
    -    264 Feb 26, 2018
    -    268 Mar 21, 2018
    -    524 Feb 22, 2018
    -    570 Feb 15, 2018
    -
    + 1 Feb 18, 2018 + 1 Feb 19, 2018 + 1 Feb 20, 2018 + 1 Feb 24, 2018 + 2 Feb 13, 2018 + 3 Feb 17, 2018 + 5 Feb 16, 2018 + 5 Feb 23, 2018 + 5 Feb 27, 2018 + 6 Feb 25, 2018 + 40 Feb 14, 2018 + 63 Feb 28, 2018 +154 Mar 19, 2018 +202 Feb 21, 2018 +264 Feb 26, 2018 +268 Mar 21, 2018 +524 Feb 22, 2018 +570 Feb 15, 2018 +
  • -

    2018-04-11

    +
  • DSpace Test (linode19) crashed again some time since yesterday:

    # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
     168
    -
    +
  • -

    2018-04-12

    @@ -438,35 +434,34 @@ UPDATE 1

    2018-04-15

    +
  • While testing an XMLUI patch for DS-3883 I noticed that there is still some remaining Authority / Solr configuration left that we need to remove:

    2018-04-14 18:55:25,841 ERROR org.dspace.authority.AuthoritySolrServiceImpl @ Authority solr is not correctly configured, check "solr.authority.server" property in the dspace.cfg
     java.lang.NullPointerException
    -
    +
  • - +
  • I assume we need to remove authority from the consumers in dspace/config/dspace.cfg:

    event.dispatcher.default.consumers = authority, versioning, discovery, eperson, harvester, statistics,batchedit, versioningmqm
    -
    +
  • - +
  • I see the same error on DSpace Test so this is definitely a problem

  • + +
  • After disabling the authority consumer I no longer see the error

  • + +
  • I merged a pull request to the 5_x-prod branch to clean that up (#372)

  • + +
  • File a ticket on DSpace’s Jira for the target="_blank" security and performance issue (DS-3891)

  • + +
  • I re-deployed DSpace Test (linode19) and was surprised by how long it took the ant update to complete:

    BUILD SUCCESSFUL
     Total time: 4 minutes 12 seconds
    -
    +
  • -

    2018-04-16

    @@ -481,69 +476,79 @@ Total time: 4 minutes 12 seconds
  • IWMI people are asking about building a search query that outputs RSS for their reports
  • They want the same results as this Discovery query: https://cgspace.cgiar.org/discover?filtertype_1=dateAccessioned&filter_relational_operator_1=contains&filter_1=2018&submit_apply_filter=&query=&scope=10568%2F16814&rpp=100&sort_by=dc.date.issued_dt&order=desc
  • They will need to use OpenSearch, but I can’t remember all the parameters
  • -
  • Apparently search sort options for OpenSearch are in dspace.cfg:
  • - + +
  • Apparently search sort options for OpenSearch are in dspace.cfg:

    webui.itemlist.sort-option.1 = title:dc.title:title
     webui.itemlist.sort-option.2 = dateissued:dc.date.issued:date
     webui.itemlist.sort-option.3 = dateaccessioned:dc.date.accessioned:date
     webui.itemlist.sort-option.4 = type:dc.type:text
    -
    +
  • - +
  • They want items by issue date, so we need to use sort option 2

  • + +
  • According to the DSpace Manual there are only the following parameters to OpenSearch: format, scope, rpp, start, and sort_by

  • + +
  • The OpenSearch query parameter expects a Discovery search filter that is defined in dspace/config/spring/api/discovery.xml

  • + +
  • So for IWMI they should be able to use something like this: https://cgspace.cgiar.org/open-search/discover?query=dateIssued:2018&scope=10568/16814&sort_by=2&order=DESC&format=rss

  • + +
  • There are also rpp (results per page) and start parameters but in my testing now on DSpace 5.5 they behave very strangely

  • + +
  • For example, set rpp=1 and then check the results for start values of 0, 1, and 2 and they are all the same!

  • + +
  • If I have time I will check if this behavior persists on DSpace 6.x on the official DSpace demo and file a bug

  • + +
  • Also, the DSpace Manual as of 5.x has very poor documentation for OpenSearch

  • + +
  • They don’t tell you to use Discovery search filters in the query (with format query=dateIssued:2018)

  • + +
  • They don’t tell you that the sort options are actually defined in dspace.cfg (ie, you need to use 2 instead of dc.date.issued_dt)

  • + +
  • They are missing the order parameter (ASC vs DESC)

  • + +
  • I notice that DSpace Test has crashed again, due to memory:

    # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
     178
    -
    +
  • - +
  • I will increase the JVM heap size from 5120M to 6144M, though we don’t have much room left to grow as DSpace Test (linode19) is using a smaller instance size than CGSpace

  • + +
  • Gabriela from CIP asked if I could send her a list of all CIP authors so she can do some replacements on the name formats

  • + +
  • I got a list of all the CIP collections manually and use the same query that I used in August, 2017:

    dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/89347', '10568/88229', '10568/53086', '10568/53085', '10568/69069', '10568/53087', '10568/53088', '10568/53089', '10568/53090', '10568/53091', '10568/53092', '10568/70150', '10568/53093', '10568/64874', '10568/53094'))) group by text_value order by count desc) to /tmp/cip-authors.csv with csv;
    -
    +
  • +

    2018-04-19

    + +
  • Also try deploying updated GeoLite database during ant update while re-deploying code:

    $ ant update update_geolite clean_backups
    -
    +
  • - +
  • I also re-deployed CGSpace (linode18) to make the ORCID search, authority cleanup, CCAFS project tag PII-LAM_CSAGender live

  • + +
  • When re-deploying I also updated the GeoLite databases so I hope the country stats become more accurate…

  • + +
  • After re-deployment I ran all system updates on the server and rebooted it

  • + +
  • After the reboot I forced a reïndexing of the Discovery to populate the new ORCID index:

    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    73m42.635s
     user    8m15.885s
     sys     2m2.687s
    -
    +
  • -

    2018-04-20

    @@ -551,48 +556,40 @@ sys 2m2.687s + +
  • The DSpace logs show that there are no database connections:

    org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-715] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:18; idle:0; lastwait:5000].
    -
    +
  • - +
  • And there have been shit tons of errors in the last (starting only 20 minutes ago luckily):

    # grep -c 'org.apache.tomcat.jdbc.pool.PoolExhaustedException' /home/cgspace.cgiar.org/log/dspace.log.2018-04-20
     32147
    -
    +
  • - +
  • I can’t even log into PostgreSQL as the postgres user, WTF?

    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c 
     ^C
    -
    +
  • - +
  • Here are the most active IPs today:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "20/Apr/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    917 207.46.13.182
    -    935 213.55.99.121
    -    970 40.77.167.134
    -    978 207.46.13.80
    -   1422 66.249.64.155
    -   1577 50.116.102.77
    -   2456 95.108.181.88
    -   3216 104.196.152.243
    -   4325 70.32.83.92
    -  10718 45.5.184.2
    -
    +917 207.46.13.182 +935 213.55.99.121 +970 40.77.167.134 +978 207.46.13.80 +1422 66.249.64.155 +1577 50.116.102.77 +2456 95.108.181.88 +3216 104.196.152.243 +4325 70.32.83.92 +10718 45.5.184.2 +
  • - +
  • It doesn’t even seem like there is a lot of traffic compared to the previous days:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "20/Apr/2018" | wc -l
     74931
    @@ -600,43 +597,46 @@ sys     2m2.687s
     91073
     # zcat --force /var/log/nginx/*.log.2.gz /var/log/nginx/*.log.3.gz| grep -E "18/Apr/2018" | wc -l
     93459
    -
    +
  • - +
  • I tried to restart Tomcat but systemctl hangs

  • + +
  • I tried to reboot the server from the command line but after a few minutes it didn’t come back up

  • + +
  • Looking at the Linode console I see that it is stuck trying to shut down

  • + +
  • Even “Reboot” via Linode console doesn’t work!

  • + +
  • After shutting it down a few times via the Linode console it finally rebooted

  • + +
  • Everything is back but I have no idea what caused this—I suspect something with the hosting provider

  • + +
  • Also super weird, the last entry in the DSpace log file is from 2018-04-20 16:35:09, and then immediately it goes to 2018-04-20 19:15:04 (three hours later!):

    2018-04-20 16:35:09,144 ERROR org.dspace.app.util.AbstractDSpaceWebapp @ Failed to record shutdown in Webapp table.
     org.apache.tomcat.jdbc.pool.PoolExhaustedException: [localhost-startStop-2] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:18; idle
     :0; lastwait:5000].
    -        at org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:685)
    -        at org.apache.tomcat.jdbc.pool.ConnectionPool.getConnection(ConnectionPool.java:187)
    -        at org.apache.tomcat.jdbc.pool.DataSourceProxy.getConnection(DataSourceProxy.java:128)
    -        at org.dspace.storage.rdbms.DatabaseManager.getConnection(DatabaseManager.java:632)
    -        at org.dspace.core.Context.init(Context.java:121)
    -        at org.dspace.core.Context.<init>(Context.java:95)
    -        at org.dspace.app.util.AbstractDSpaceWebapp.deregister(AbstractDSpaceWebapp.java:97)
    -        at org.dspace.app.util.DSpaceContextListener.contextDestroyed(DSpaceContextListener.java:146)
    -        at org.apache.catalina.core.StandardContext.listenerStop(StandardContext.java:5115)
    -        at org.apache.catalina.core.StandardContext.stopInternal(StandardContext.java:5779)
    -        at org.apache.catalina.util.LifecycleBase.stop(LifecycleBase.java:224)
    -        at org.apache.catalina.core.ContainerBase$StopChild.call(ContainerBase.java:1588)
    -        at org.apache.catalina.core.ContainerBase$StopChild.call(ContainerBase.java:1577)
    -        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    -        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    -        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    -        at java.lang.Thread.run(Thread.java:748)
    +    at org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:685)
    +    at org.apache.tomcat.jdbc.pool.ConnectionPool.getConnection(ConnectionPool.java:187)
    +    at org.apache.tomcat.jdbc.pool.DataSourceProxy.getConnection(DataSourceProxy.java:128)
    +    at org.dspace.storage.rdbms.DatabaseManager.getConnection(DatabaseManager.java:632)
    +    at org.dspace.core.Context.init(Context.java:121)
    +    at org.dspace.core.Context.<init>(Context.java:95)
    +    at org.dspace.app.util.AbstractDSpaceWebapp.deregister(AbstractDSpaceWebapp.java:97)
    +    at org.dspace.app.util.DSpaceContextListener.contextDestroyed(DSpaceContextListener.java:146)
    +    at org.apache.catalina.core.StandardContext.listenerStop(StandardContext.java:5115)
    +    at org.apache.catalina.core.StandardContext.stopInternal(StandardContext.java:5779)
    +    at org.apache.catalina.util.LifecycleBase.stop(LifecycleBase.java:224)
    +    at org.apache.catalina.core.ContainerBase$StopChild.call(ContainerBase.java:1588)
    +    at org.apache.catalina.core.ContainerBase$StopChild.call(ContainerBase.java:1577)
    +    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    +    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    +    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    +    at java.lang.Thread.run(Thread.java:748)
     2018-04-20 19:15:04,006 INFO  org.dspace.core.ConfigurationManager @ Loading from classloader: file:/home/cgspace.cgiar.org/config/dspace.cfg
    -
    +
  • -

    2018-04-24

    @@ -660,34 +660,32 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [localhost-startStop-2] Time + +
  • So for my notes, when I’m importing a CGSpace database dump I need to amend my notes to give super user permission to a user, rather than create user:

    $ psql dspacetest -c 'alter user dspacetest superuser;'
     $ pg_restore -O -U dspacetest -d dspacetest -W -h localhost /tmp/dspace_2018-04-18.backup
    -
    +
  • - +
  • There’s another issue with Tomcat in Ubuntu 18.04:

    25-Apr-2018 13:26:21.493 SEVERE [http-nio-127.0.0.1-8443-exec-1] org.apache.coyote.AbstractProtocol$ConnectionHandler.process Error reading request, ignored
    - java.lang.NoSuchMethodError: java.nio.ByteBuffer.position(I)Ljava/nio/ByteBuffer;
    -        at org.apache.coyote.http11.Http11InputBuffer.init(Http11InputBuffer.java:688)
    -        at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:672)
    -        at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
    -        at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:790)
    -        at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1459)
    -        at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
    -        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    -        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    -        at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    -        at java.lang.Thread.run(Thread.java:748)
    -
    +java.lang.NoSuchMethodError: java.nio.ByteBuffer.position(I)Ljava/nio/ByteBuffer; + at org.apache.coyote.http11.Http11InputBuffer.init(Http11InputBuffer.java:688) + at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:672) + at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66) + at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:790) + at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1459) + at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) + at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) + at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) + at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) + at java.lang.Thread.run(Thread.java:748) +
  • -

    2018-04-29

    diff --git a/docs/2018-05/index.html b/docs/2018-05/index.html index cd3d5da9e..51b6d4ecc 100644 --- a/docs/2018-05/index.html +++ b/docs/2018-05/index.html @@ -37,7 +37,7 @@ http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E Then I reduced the JVM heap size from 6144 back to 5120m Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the Ansible infrastructure scripts to support hosts choosing which distribution they want to use "/> - + @@ -164,72 +164,75 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked + +
  • I export them and include the hidden metadata fields like dc.date.accessioned so I can filter the ones from 2018-04 and correct them in Open Refine:

    $ dspace metadata-export -a -f /tmp/iita.csv -i 10568/68616
    -
    +
  • -

    2018-05-06

    + +
  • I corrected all the DOIs and then checked them for validity with a quick bash loop:

    $ for line in $(< /tmp/links.txt); do echo $line; http --print h $line; done
    -
    +
  • - + +
  • Fixed some issues in regions, countries, sponsors, ISSN, and cleaned whitespace errors from citation, abstract, author, and titles

  • + +
  • Fixed all issues with CRPs

  • + +
  • A few more interesting Unicode characters to look for in text fields like author, abstracts, and citations might be: (0x2019), · (0x00b7), and (0x20ac)

  • + +
  • A custom text facit in OpenRefine with this GREL expression could be a good for finding invalid characters or encoding errors in authors, abstracts, etc:

    or(
    -  isNotNull(value.match(/.*[(|)].*/)),
    -  isNotNull(value.match(/.*\uFFFD.*/)),
    -  isNotNull(value.match(/.*\u00A0.*/)),
    -  isNotNull(value.match(/.*\u200A.*/)),
    -  isNotNull(value.match(/.*\u2019.*/)),
    -  isNotNull(value.match(/.*\u00b7.*/)),
    -  isNotNull(value.match(/.*\u20ac.*/))
    +isNotNull(value.match(/.*[(|)].*/)),
    +isNotNull(value.match(/.*\uFFFD.*/)),
    +isNotNull(value.match(/.*\u00A0.*/)),
    +isNotNull(value.match(/.*\u200A.*/)),
    +isNotNull(value.match(/.*\u2019.*/)),
    +isNotNull(value.match(/.*\u00b7.*/)),
    +isNotNull(value.match(/.*\u20ac.*/))
     )
    -
    +
  • - +
  • I found some more IITA records that Sisay imported on 2018-03-23 that have invalid CRP names, so now I kinda want to check those ones!

  • + +
  • Combine the ORCID identifiers Abenet sent with our existing list and resolve their names using the resolve-orcids.py script:

    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ilri-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2018-05-06-combined.txt
     $ ./resolve-orcids.py -i /tmp/2018-05-06-combined.txt -o /tmp/2018-05-06-combined-names.txt -d
     # sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
     $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
    -
    +
  • -

    2018-05-07

    @@ -249,16 +252,14 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
  • I told him that there were still some TODO items for him on that data, for example to update the dc.language.iso field for the Spanish items
  • I was trying to remember how I parsed the input-forms.xml using xmllint to extract subjects neatly
  • I could use it with reconcile-csv or to populate a Solr instance for reconciliation
  • -
  • This XPath expression gets close, but outputs all items on one line:
  • - + +
  • This XPath expression gets close, but outputs all items on one line:

    $ xmllint --xpath '//value-pairs[@value-pairs-name="crpsubject"]/pair/stored-value/node()' dspace/config/input-forms.xml        
     Agriculture for Nutrition and HealthBig DataClimate Change, Agriculture and Food SecurityExcellence in BreedingFishForests, Trees and AgroforestryGenebanksGrain Legumes and Dryland CerealsLivestockMaizePolicies, Institutions and MarketsRiceRoots, Tubers and BananasWater, Land and EcosystemsWheatAquatic Agricultural SystemsDryland CerealsDryland SystemsGrain LegumesIntegrated Systems for the Humid TropicsLivestock and Fish
    -
    +
  • - +
  • Maybe xmlstarlet is better:

    $ xmlstarlet sel -t -v '//value-pairs[@value-pairs-name="crpsubject"]/pair/stored-value/text()' dspace/config/input-forms.xml
     Agriculture for Nutrition and Health
    @@ -282,20 +283,20 @@ Dryland Systems
     Grain Legumes
     Integrated Systems for the Humid Tropics
     Livestock and Fish
    -
    +
  • - +
  • Discuss Colombian BNARS harvesting the CIAT data from CGSpace

  • + +
  • They are using a system called Primo and the only options for data harvesting in that system are via FTP and OAI

  • + +
  • I told them to get all CIAT records via OAI

  • + +
  • Just a note to myself, I figured out how to get reconcile-csv to run from source rather than running the old pre-compiled JAR file:

    $ lein run /tmp/crps.csv name id
    -
    +
  • -

    2018-05-13

    @@ -329,83 +330,85 @@ Livestock and Fish + +
  • This will fetch a URL and return its HTTP response code:

    import urllib2
     import re
     
     pattern = re.compile('.*10.1016.*')
     if pattern.match(value):
    -  get = urllib2.urlopen(value)
    -  return get.getcode()
    +get = urllib2.urlopen(value)
    +return get.getcode()
     
     return "blank"
    -
    +
  • - +
  • I used a regex to limit it to just some of the DOIs in this case because there were thousands of URLs

  • + +
  • Here the response code would be 200, 404, etc, or “blank” if there is no URL for that item

  • + +
  • You could use this in a facet or in a new column

  • + +
  • More information and good examples here: https://programminghistorian.org/lessons/fetch-and-parse-data-with-openrefine

  • + +
  • Finish looking at the 2,640 CIFOR records on DSpace Test (1056892904), cleaning up authors and adding collection mappings

  • + +
  • They can now be moved to CGSpace as far as I’m concerned, but I don’t know if Sisay will do it or me

  • + +
  • I was checking the CIFOR data for duplicates using Atmire’s Metadata Quality Module (and found some duplicates actually), but then DSpace died…

  • + +
  • I didn’t see anything in the Tomcat, DSpace, or Solr logs, but I saw this in dmest -T:

    [Tue May 15 12:10:01 2018] Out of memory: Kill process 3763 (java) score 706 or sacrifice child
     [Tue May 15 12:10:01 2018] Killed process 3763 (java) total-vm:14667688kB, anon-rss:5705268kB, file-rss:0kB, shmem-rss:0kB
     [Tue May 15 12:10:01 2018] oom_reaper: reaped process 3763 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    -
    +
  • - +
  • So the Linux kernel killed Java…

  • + +
  • Maria from Bioversity mailed to say she got an error while submitting an item on CGSpace:

    Unable to load Submission Information, since WorkspaceID (ID:S96060) is not a valid in-process submission
    -
    +
  • - +
  • Looking in the DSpace log I see something related:

    2018-05-15 12:35:30,858 INFO  org.dspace.submit.step.CompleteStep @ m.garruccio@cgiar.org:session_id=8AC4499945F38B45EF7A1226E3042DAE:submission_complete:Completed submission with id=96060
    -
    +
  • - +
  • So I’m not sure…

  • + +
  • I finally figured out how to get OpenRefine to reconcile values from Solr via conciliator:

  • + +
  • The trick was to use a more appropriate Solr fieldType text_en instead of text_general so that more terms match, for example uppercase and lower case:

    $ ./bin/solr start
     $ ./bin/solr create_core -c countries
     $ curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": {"name":"country", "type":"text_en", "multiValued":false, "stored":true}}' http://localhost:8983/solr/countries/schema
     $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
    -
    +
  • -

    OpenRefine reconciling countries from local Solr

    +
  • I should probably make a general copy field and set it to be the default search field, like DSpace’s search core does (see schema.xml):

    <defaultSearchField>search_text</defaultSearchField>
     ...
     <copyField source="*" dest="search_text"/>
    -
    +
  • -

    2018-05-16

    @@ -422,18 +425,19 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
  • Silvia asked if I could sort the records in her Listings and Report output and it turns out that the options are misconfigured in dspace/config/modules/atmire-listings-and-reports.cfg
  • I created and merged a pull request to fix the sorting issue in Listings and Reports (#374)
  • -
  • Regarding the IP Address Anonymization for GDPR, I ammended the Google Analytics snippet in page-structure-alterations.xsl to:
  • - + +
  • Regarding the IP Address Anonymization for GDPR, I ammended the Google Analytics snippet in page-structure-alterations.xsl to:

    ga('send', 'pageview', {
    -  'anonymizeIp': true
    +'anonymizeIp': true
     });
    -
    +
  • -

    2018-05-17

    @@ -495,18 +499,20 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv

    2018-05-23

    +
  • I’m investigating how many non-CGIAR users we have registered on CGSpace:

    dspace=# select email, netid from eperson where email not like '%cgiar.org%' and email like '%@%';
    -
    +
  • -

    2018-05-28

    @@ -523,54 +529,60 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv + +
  • I see this in dmesg:

    [Wed May 30 00:00:39 2018] Out of memory: Kill process 6082 (java) score 697 or sacrifice child
     [Wed May 30 00:00:39 2018] Killed process 6082 (java) total-vm:14876264kB, anon-rss:5683372kB, file-rss:0kB, shmem-rss:0kB
     [Wed May 30 00:00:40 2018] oom_reaper: reaped process 6082 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    -
    +
  • - +
  • I need to check the Tomcat JVM heap size/usage, command line JVM heap size (for cron jobs), and PostgreSQL memory usage

  • + +
  • It might be possible to adjust some things, but eventually we’ll need a larger VPS instance

  • + +
  • For some reason there are no JVM stats in Munin, ugh

  • + +
  • Run all system updates on DSpace Test and reboot it

  • + +
  • I generated a list of CIFOR duplicates from the CIFOR_May_9 collection using the Atmire MQM module and then dumped the HTML source so I could process it for sending to Vika

  • + +
  • I used grep to filter all relevant handle lines from the HTML source then used sed to insert a newline before each “Item1” line (as the duplicates are grouped like Item1, Item2, Item3 for each set of duplicates):

    $ grep -E 'aspect.duplicatechecker.DuplicateResults.field.del_handle_[0-9]{1,3}_Item' ~/Desktop/https\ _dspacetest.cgiar.org_atmire_metadata-quality_duplicate-checker.html > ~/cifor-duplicates.txt
     $ sed 's/.*Item1.*/\n&/g' ~/cifor-duplicates.txt > ~/cifor-duplicates-cleaned.txt
    -
    +
  • - +
  • I told Vika to look through the list manually and indicate which ones are indeed duplicates that we should delete, and which ones to map to CIFOR’s collection

  • + +
  • A few weeks ago Peter wanted a list of authors from the ILRI collections, so I need to find a way to get the handles of all those collections

  • + +
  • I can use the /communities/{id}/collections endpoint of the REST API but it only takes IDs (not handles) and doesn’t seem to descend into sub communities

  • + +
  • Shit, so I need the IDs for the the top-level ILRI community and all its sub communities (and their sub communities)

  • + +
  • There has got to be a better way to do this than going to each community and getting their handles and IDs manually

  • + +
  • Oh shit, I literally already wrote a script to get all collections in a community hierarchy from the REST API: rest-find-collections.py

  • + +
  • The output isn’t great, but all the handles and IDs are printed in debug mode:

    $ ./rest-find-collections.py -u https://cgspace.cgiar.org/rest -d 10568/1 2> /tmp/ilri-collections.txt
    -
    +
  • - +
  • Then I format the list of handles and put it into this SQL query to export authors from items ONLY in those collections (too many to list here):

    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/67236','10568/67274',...))) group by text_value order by count desc) to /tmp/ilri-authors.csv with csv;
    -
    +
  • +

    2018-05-31

    + +
  • Now I can just use Docker:

    $ docker pull postgres:9.5-alpine
     $ docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.5-alpine
    @@ -581,7 +593,8 @@ $ pg_restore -h localhost -O -U dspacetest -d dspacetest -W -h localhost ~/Downl
     $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
     $ psql -h localhost -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
     $ psql -h localhost -U postgres dspacetest
    -
    +
  • + diff --git a/docs/2018-06/index.html b/docs/2018-06/index.html index f7677a31d..1eca381c3 100644 --- a/docs/2018-06/index.html +++ b/docs/2018-06/index.html @@ -15,22 +15,22 @@ Test the DSpace 5.8 module upgrades from Atmire (#378) There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn’t build I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379) -I proofed and tested the ILRI author corrections that Peter sent back to me this week: +I proofed and tested the ILRI author corrections that Peter sent back to me this week: $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n - I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018 -Time to index ~70,000 items on CGSpace: +Time to index ~70,000 items on CGSpace: $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b real 74m42.646s user 8m5.056s sys 2m7.289s + " /> @@ -48,24 +48,24 @@ Test the DSpace 5.8 module upgrades from Atmire (#378) There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn’t build I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379) -I proofed and tested the ILRI author corrections that Peter sent back to me this week: +I proofed and tested the ILRI author corrections that Peter sent back to me this week: $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n - I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018 -Time to index ~70,000 items on CGSpace: +Time to index ~70,000 items on CGSpace: $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b real 74m42.646s user 8m5.056s sys 2m7.289s + "/> - + @@ -153,23 +153,23 @@ sys 2m7.289s
  • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn’t build
  • I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
  • -
  • I proofed and tested the ILRI author corrections that Peter sent back to me this week:
  • - + +
  • I proofed and tested the ILRI author corrections that Peter sent back to me this week:

    $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
    -
    +
  • - +
  • I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018

  • + +
  • Time to index ~70,000 items on CGSpace:

    $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
     
     real    74m42.646s
     user    8m5.056s
     sys     2m7.289s
    -
    +
  • +

    2018-06-06

    @@ -198,32 +198,29 @@ sys 2m7.289s
  • Universit F lix Houphouet-Boigny
  • I uploaded fixes for all those now, but I will continue with the rest of the data later
  • -
  • Regarding the SQL migration errors, Atmire told me I need to run some migrations manually in PostgreSQL:
  • - + +
  • Regarding the SQL migration errors, Atmire told me I need to run some migrations manually in PostgreSQL:

    delete from schema_version where version = '5.6.2015.12.03.2';
     update schema_version set version = '5.6.2015.12.03.2' where version = '5.5.2015.12.03.2';
     update schema_version set version = '5.8.2015.12.03.3' where version = '5.5.2015.12.03.3';
    -
    +
  • - +
  • And then I need to ignore the ignored ones:

    $ ~/dspace/bin/dspace database migrate ignored
    -
    +
  • - +
  • Now DSpace starts up properly!

  • + +
  • Gabriela from CIP got back to me about the author names we were correcting on CGSpace

  • + +
  • I did a quick sanity check on them and then did a test import with my fix-metadata-value.py script:

    $ ./fix-metadata-values.py -i /tmp/2018-06-08-CIP-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
    -
    +
  • -

    2018-06-09

    @@ -238,17 +235,18 @@ update schema_version set version = '5.8.2015.12.03.3' where version = '5.5.2015 -
     INFO [org.dspace.servicemanager.DSpaceServiceManager] Shutdown DSpace core service manager
    +
  • After removing all code mentioning MQM, mqm, metadata-quality, batchedit, duplicatechecker, etc, I think I got most of it removed, but there is a Spring error during Tomcat startup:

    + +
    INFO [org.dspace.servicemanager.DSpaceServiceManager] Shutdown DSpace core service manager
     Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'org.dspace.servicemanager.spring.DSpaceBeanPostProcessor#0' defined in class path resource [spring/spring-dspace-applicationContext.xml]: Unsatisfied dependency expressed through constructor argument with index 0 of type [org.dspace.servicemanager.config.DSpaceConfigurationService]: : Cannot find class [com.atmire.dspace.discovery.ItemCollectionPlugin] for bean with name 'itemCollectionPlugin' defined in file [/home/aorth/dspace/config/spring/api/discovery.xml];
    -
    +
  • -

    2018-06-11

    @@ -301,7 +300,8 @@ Failed to startup the DSpace Service Manager: failure starting up spring service
  • The style sheet obfuscates the data, but if you look at the source it is all there, including information about pagination of results
  • Regarding Udana’s Book Chapters and Reports on DSpace Test last week, Abenet told him to fix some character encoding and CRP issues, then I told him I’d check them after that
  • The latest batch of IITA’s 200 records (based on Abenet’s version Mercy1805_AY.xls) are now in the IITA_Jan_9_II_Ab collection
  • -
  • So here are some corrections: + +
  • So here are some corrections:

  • - + +
  • Based on manually eyeballing the text I used a custom text facet with this GREL to identify the records:

    or(
    -  value.contains('€'),
    -  value.contains('6g'),
    -  value.contains('6m'),
    -  value.contains('6d'),
    -  value.contains('6e')
    +value.contains('€'),
    +value.contains('6g'),
    +value.contains('6m'),
    +value.contains('6d'),
    +value.contains('6e')
     )
    -
    +
  • - @@ -366,38 +366,33 @@ Failed to startup the DSpace Service Manager: failure starting up spring service + +
  • I used my add-orcid-identifiers-csv.py script:

    $ ./add-orcid-identifiers-csv.py -i 2018-06-13-Robin-Buruchara.csv -db dspace -u dspace -p 'fuuu'
    -
    +
  • - +
  • The contents of 2018-06-13-Robin-Buruchara.csv were:

    dc.contributor.author,cg.creator.id
     "Buruchara, Robin",Robin Buruchara: 0000-0003-0934-1218
     "Buruchara, Robin A.",Robin Buruchara: 0000-0003-0934-1218
    -
    +
  • - +
  • On a hunch I checked to see if CGSpace’s bitstream cleanup was working properly and of course it’s broken:

    $ dspace cleanup -v
     ...
     Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -  Detail: Key (bitstream_id)=(152402) is still referenced from table "bundle".
    -
    +Detail: Key (bitstream_id)=(152402) is still referenced from table "bundle". +
  • - +
  • As always, the solution is to delete that ID manually in PostgreSQL:

    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (152402);'
     UPDATE 1
    -
    +
  • +

    2018-06-14

    @@ -411,39 +406,47 @@ UPDATE 1

    2018-06-24

    +
  • I was restoring a PostgreSQL dump on my test machine and found a way to restore the CGSpace dump as the postgres user, but have the owner of the schema be the dspacetest user:

    $ dropdb -h localhost -U postgres dspacetest
     $ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
     $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
     $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost /tmp/cgspace_2018-06-24.backup
     $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
    -
    +
  • - +
  • The -O option to pg_restore makes the import process ignore ownership specified in the dump itself, and instead makes the schema owned by the user doing the restore

  • + +
  • I always prefer to use the postgres user locally because it’s just easier than remembering the dspacetest user’s password, but then I couldn’t figure out why the resulting schema was owned by postgres

  • + +
  • So with this you connect as the postgres superuser and then switch roles to dspacetest (also, make sure this user has superuser privileges before the restore)

  • + +
  • Last week Linode emailed me to say that our Linode 8192 instance used for DSpace Test qualified for an upgrade

  • + +
  • Apparently they announced some upgrades to most of their plans in 2018-05

  • + +
  • After the upgrade I see we have more disk space available in the instance’s dashboard, so I shut the instance down and resized it from 98GB to 160GB

  • + +
  • The resize was very quick (less than one minute) and after booting the instance back up I now have 160GB for the root filesystem!

  • + +
  • I will move the DSpace installation directory back to the root file system and delete the extra 300GB block storage, as it was actually kinda slow when we put Solr there and now we don’t actually need it anymore because running the production Solr on this instance didn’t work well with 8GB of RAM

  • + +
  • Also, the larger instance we’re using for CGSpace will go from 24GB of RAM to 32, and will also get a storage increase from 320GB to 640GB… that means we don’t need to consider using block storage right now!

  • + +
  • The smaller instances get increased storage and network speed but I doubt many are actually using much of their current allocations so we probably don’t need to bother with upgrading them

  • + +
  • Last week Abenet asked if we could add dc.language.iso to the advanced search filters

  • + +
  • There is already a search filter for this field defined in discovery.xml but we aren’t using it, so I quickly enabled and tested it, then merged it to the 5_x-prod branch (#380)

  • + +
  • Back to testing the DSpace 5.8 changes from Atmire, I had another issue with SQL migrations:

    Caused by: org.flywaydb.core.api.FlywayException: Validate failed. Found differences between applied migrations and available migrations: Detected applied migration missing on the classpath: 5.8.2015.12.03.3
    -
    +
  • - +
  • It took me a while to figure out that this migration is for MQM, which I removed after Atmire’s original advice about the migrations so we actually need to delete this migration instead up updating it

  • + +
  • So I need to make sure to run the following during the DSpace 5.8 upgrade:

    -- Delete existing CUA 4 migration if it exists
     delete from schema_version where version = '5.6.2015.12.03.2';
    @@ -453,49 +456,45 @@ update schema_version set version = '5.6.2015.12.03.2' where version = '5.5.2015
     
     -- Delete MQM migration since we're no longer using it
     delete from schema_version where version = '5.5.2015.12.03.3';
    -
    +
  • - +
  • After that you can run the migrations manually and then DSpace should work fine:

    $ ~/dspace/bin/dspace database migrate ignored
     ...
     Done.
    -
    +
  • - +
  • Elizabeth from CIAT contacted me to ask if I could add ORCID identifiers to all of Andy Jarvis’ items on CGSpace

  • + +
  • I used my add-orcid-identifiers-csv.py script:

    $ ./add-orcid-identifiers-csv.py -i 2018-06-24-andy-jarvis-orcid.csv -db dspacetest -u dspacetest -p 'fuuu'
    -
    +
  • - +
  • The contents of 2018-06-24-andy-jarvis-orcid.csv were:

    dc.contributor.author,cg.creator.id
     "Jarvis, A.",Andy Jarvis: 0000-0001-6543-0798
     "Jarvis, Andy",Andy Jarvis: 0000-0001-6543-0798
     "Jarvis, Andrew",Andy Jarvis: 0000-0001-6543-0798
    -
    +
  • +

    2018-06-26

    + +
  • This warning appears in the DSpace log:

    2018-06-26 16:58:12,052 WARN  org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
    -
    +
  • -

    2018-06-27

    @@ -503,8 +502,8 @@ Done. + +
  • First, get the 62 deletes from Vika’s file and remove them from the collection:

    $ grep delete 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-delete.txt
     $ wc -l cifor-handle-to-delete.txt
    @@ -514,51 +513,53 @@ $ wc -l 10568-92904.csv
     $ while read line; do sed -i "\#$line#d" 10568-92904.csv; done < cifor-handle-to-delete.txt
     $ wc -l 10568-92904.csv
     2399 10568-92904.csv
    -
    +
  • - +
  • This iterates over the handles for deletion and uses sed with an alternative pattern delimiter of ‘#’ (which must be escaped), because the pattern itself contains a ‘/’

  • + +
  • The mapped ones will be difficult because we need their internal IDs in order to map them, and there are 50 of them:

    $ grep map 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-map.txt
     $ wc -l cifor-handle-to-map.txt
     50 cifor-handle-to-map.txt
    -
    +
  • - +
  • I can either get them from the databse, or programatically export the metadata using dspace metadata-export -i 10568/xxxxx

  • + +
  • Oooh, I can export the items one by one, concatenate them together, remove the headers, and extract the id and collection columns using csvkit:

    $ while read line; do filename=${line/\//-}.csv; dspace metadata-export -i $line -f $filename; done < /tmp/cifor-handle-to-map.txt
     $ sed '/^id/d' 10568-*.csv | csvcut -c 1,2 > map-to-cifor-archive.csv
    -
    +
  • -

    2018-06-28

    + +
  • There is nothing in the Tomcat or DSpace logs, but I see the following in dmesg -T:

    [Thu Jun 28 00:00:30 2018] Out of memory: Kill process 14501 (java) score 701 or sacrifice child
     [Thu Jun 28 00:00:30 2018] Killed process 14501 (java) total-vm:14926704kB, anon-rss:5693608kB, file-rss:0kB, shmem-rss:0kB
     [Thu Jun 28 00:00:30 2018] oom_reaper: reaped process 14501 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    -
    +
  • - diff --git a/docs/2018-07/index.html b/docs/2018-07/index.html index 1ca7e4fe4..715a9fdf4 100644 --- a/docs/2018-07/index.html +++ b/docs/2018-07/index.html @@ -11,15 +11,13 @@ I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case: - $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace - During the mvn package stage on the 5.8 branch I kept getting issues with java running out of memory: - There is insufficient memory for the Java Runtime Environment to continue. + " /> @@ -33,17 +31,15 @@ There is insufficient memory for the Java Runtime Environment to continue. I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case: - $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace - During the mvn package stage on the 5.8 branch I kept getting issues with java running out of memory: - There is insufficient memory for the Java Runtime Environment to continue. + "/> - + @@ -125,30 +121,25 @@ There is insufficient memory for the Java Runtime Environment to continue.

    2018-07-01

    +
  • I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:

    $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
    -
    +
  • - +
  • During the mvn package stage on the 5.8 branch I kept getting issues with java running out of memory:

    There is insufficient memory for the Java Runtime Environment to continue.
    -
    +
  • + +
  • As the machine only has 8GB of RAM, I reduced the Tomcat memory heap from 5120m to 4096m so I could try to allocate more to the build process:

    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
     $ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=dspacetest.cgiar.org -P \!dspace-lni,\!dspace-rdf,\!dspace-sword,\!dspace-swordv2 clean package
    -
    +
  • - +
  • Then I stopped the Tomcat 7 service, ran the ant update, and manually ran the old and ignored SQL migrations:

    $ sudo su - postgres
     $ psql dspace
    @@ -163,10 +154,9 @@ dspace=# commit
     dspace=# \q
     $ exit
     $ dspace database migrate ignored
    -
    +
  • -

    2018-07-02

    @@ -179,38 +169,34 @@ $ dspace database migrate ignored

    2018-07-03

    + +
  • Then I imported those 2398 items in two batches (to deal with memory issues):

    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
     $ dspace metadata-import -e aorth@mjanja.ch -f /tmp/2018-06-27-New-CIFOR-Archive.csv
     $ dspace metadata-import -e aorth@mjanja.ch -f /tmp/2018-06-27-New-CIFOR-Archive2.csv
    -
    +
  • + - +
  • I noticed there are many items that use HTTP instead of HTTPS for their Google Books URL, and some missing HTTP entirely:

    dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value like 'http://books.google.%';
    - count
    +count
     -------
    -   785
    +785
     dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value ~ '^books\.google\..*';
    - count
    +count
     -------
    -     4
    -
    + 4 +
  • - +
  • I think I should fix that as well as some other garbage values like “test” and “dspace.ilri.org” etc:

    dspace=# begin;
     dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://books.google', 'https://books.google') where resource_type_id=2 and metadata_field_id=222 and text_value like 'http://books.google.%';
    @@ -222,14 +208,12 @@ UPDATE 1
     dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=222 and metadata_value_id in (2299312, 10684, 10700, 996403);
     DELETE 4
     dspace=# commit;
    -
    +
  • - +
  • Testing DSpace 5.8 with PostgreSQL 9.6 and Tomcat 8.5.32 (instead of my usual 7.0.88) and for some reason I get autowire errors on Catalina startup with 8.5.32:

    03-Jul-2018 19:51:37.272 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
    - java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
    +java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
     	at org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener.contextInitialized(DSpaceKernelServletContextListener.java:92)
     	at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4792)
     	at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5256)
    @@ -245,10 +229,9 @@ dspace=# commit;
     	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     	at java.lang.Thread.run(Thread.java:748)
     Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
    -
    +
  • -

    2018-07-04

    @@ -274,92 +257,96 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana
  • I was tempted to do the Linode instance upgrade on CGSpace (linode18), but after looking closely at the system backups I noticed that Solr isn’t being backed up to S3
  • I apparently noticed this—and fixed it!—in 2016-07, but it doesn’t look like the backup has been updated since then!
  • It looks like I added Solr to the backup_to_s3.sh script, but that script is not even being used (s3cmd is run directly from root’s crontab)
  • -
  • For now I have just initiated a manual S3 backup of the Solr data:
  • - + +
  • For now I have just initiated a manual S3 backup of the Solr data:

    # s3cmd sync --delete-removed /home/backup/solr/ s3://cgspace.cgiar.org/solr/
    -
    +
  • - +
  • But I need to add this to cron!

  • + +
  • I wonder if I should convert some of the cron jobs to systemd services / timers…

  • + +
  • I sent a note to all our users on Yammer to ask them about possible maintenance on Sunday, July 14th

  • + +
  • Abenet wants to be able to search by journal title (dc.source) in the advanced Discovery search so I opened an issue for it (#384)

  • + +
  • I regenerated the list of names for all our ORCID iDs using my resolve-orcids.py script:

    $ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq > /tmp/2018-07-08-orcids.txt
     $ ./resolve-orcids.py -i /tmp/2018-07-08-orcids.txt -o /tmp/2018-07-08-names.txt -d
    -
    +
  • -

    2018-07-09

    + +
  • Uptime Robot said that CGSpace was down for two minutes again later in the day, and this time I saw a memory error in Tomcat’s catalina.out:

    Exception in thread "http-bio-127.0.0.1-8081-exec-557" java.lang.OutOfMemoryError: Java heap space
    -
    +
  • - +
  • I’m not sure if it’s the same error, but I see this in DSpace’s solr.log:

    2018-07-09 06:25:09,913 ERROR org.apache.solr.servlet.SolrDispatchFilter @ null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
    -
    +
  • - +
  • I see a strange error around that time in dspace.log.2018-07-08:

    2018-07-09 06:23:43,510 ERROR com.atmire.statistics.SolrLogThread @ IOException occured when talking to server at: http://localhost:8081/solr/statistics
     org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr/statistics
    -
    +
  • - +
  • But not sure what caused that…

  • + +
  • I got a message from Linode tonight that CPU usage was high on CGSpace for the past few hours around 8PM GMT

  • + +
  • Looking in the nginx logs I see the top ten IP addresses active today:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "09/Jul/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -   1691 40.77.167.84
    -   1701 40.77.167.69
    -   1718 50.116.102.77
    -   1872 137.108.70.6
    -   2172 157.55.39.234
    -   2190 207.46.13.47
    -   2848 178.154.200.38
    -   4367 35.227.26.162
    -   4387 70.32.83.92
    -   4738 95.108.181.88
    -
    +1691 40.77.167.84 +1701 40.77.167.69 +1718 50.116.102.77 +1872 137.108.70.6 +2172 157.55.39.234 +2190 207.46.13.47 +2848 178.154.200.38 +4367 35.227.26.162 +4387 70.32.83.92 +4738 95.108.181.88 +
  • - +
  • Of those, all except 70.32.83.92 and 50.116.102.77 are NOT re-using their Tomcat sessions, for example from the XMLUI logs:

    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-07-09
     4435
    -
    +
  • -

    2018-07-10

    @@ -372,32 +359,30 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
  • All were tested and merged to the 5_x-prod branch and will be deployed on CGSpace this coming weekend when I do the Linode server upgrade
  • I need to get them onto the 5.8 testing branch too, either via cherry-picking or by rebasing after we finish testing Atmire’s 5.8 pull request (#378)
  • Linode sent an alert about CPU usage on CGSpace again, about 13:00UTC
  • -
  • These are the top ten users in the last two hours:
  • - + +
  • These are the top ten users in the last two hours:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Jul/2018:(11|12|13)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -     81 193.95.22.113
    -     82 50.116.102.77
    -    112 40.77.167.90
    -    117 196.190.95.98
    -    120 178.154.200.38
    -    215 40.77.167.96
    -    243 41.204.190.40
    -    415 95.108.181.88
    -    695 35.227.26.162
    -    697 213.139.52.250
    -
    + 81 193.95.22.113 + 82 50.116.102.77 +112 40.77.167.90 +117 196.190.95.98 +120 178.154.200.38 +215 40.77.167.96 +243 41.204.190.40 +415 95.108.181.88 +695 35.227.26.162 +697 213.139.52.250 +
  • - +
  • Looks like 213.139.52.250 is Moayad testing his new CGSpace vizualization thing:

    213.139.52.250 - - [10/Jul/2018:13:39:41 +0000] "GET /bitstream/handle/10568/75668/dryad.png HTTP/2.0" 200 53750 "http://localhost:4200/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36"
    -
    +
  • -

    2018-07-11

    @@ -417,85 +402,83 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki + +
  • Here are the top ten IPs from last night and this morning:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "11/Jul/2018:22" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -     48 66.249.64.91
    -     50 35.227.26.162
    -     57 157.55.39.234
    -     59 157.55.39.71
    -     62 147.99.27.190
    -     82 95.108.181.88
    -     92 40.77.167.90
    -     97 183.128.40.185
    -     97 240e:f0:44:fa53:745a:8afe:d221:1232
    -   3634 208.110.72.10
    + 48 66.249.64.91
    + 50 35.227.26.162
    + 57 157.55.39.234
    + 59 157.55.39.71
    + 62 147.99.27.190
    + 82 95.108.181.88
    + 92 40.77.167.90
    + 97 183.128.40.185
    + 97 240e:f0:44:fa53:745a:8afe:d221:1232
    +3634 208.110.72.10
     # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "12/Jul/2018:00" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -     25 216.244.66.198
    -     38 40.77.167.185
    -     46 66.249.64.93
    -     56 157.55.39.71
    -     60 35.227.26.162
    -     65 157.55.39.234
    -     83 95.108.181.88
    -     87 66.249.64.91
    -     96 40.77.167.90
    -   7075 208.110.72.10
    -
    + 25 216.244.66.198 + 38 40.77.167.185 + 46 66.249.64.93 + 56 157.55.39.71 + 60 35.227.26.162 + 65 157.55.39.234 + 83 95.108.181.88 + 87 66.249.64.91 + 96 40.77.167.90 +7075 208.110.72.10 +
  • - +
  • We have never seen 208.110.72.10 before… so that’s interesting!

  • + +
  • The user agent for these requests is: Pcore-HTTP/v0.44.0

  • + +
  • A brief Google search doesn’t turn up any information about what this bot is, but lots of users complaining about it

  • + +
  • This bot does make a lot of requests all through the day, although it seems to re-use its Tomcat session:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -  17098 208.110.72.10
    +17098 208.110.72.10
     # grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-11
     1161
     # grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-12
     1885
    -
    +
  • - +
  • I think the problem is that, despite the bot requesting robots.txt, it almost exlusively requests dynamic pages from /discover:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | grep -o -E "GET /(browse|discover|search-filter)" | sort -n | uniq -c | sort -rn
    -  13364 GET /discover
    -    993 GET /search-filter
    -    804 GET /browse
    +13364 GET /discover
    +993 GET /search-filter
    +804 GET /browse
     # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | grep robots
     208.110.72.10 - - [12/Jul/2018:00:22:28 +0000] "GET /robots.txt HTTP/1.1" 200 1301 "https://cgspace.cgiar.org/robots.txt" "Pcore-HTTP/v0.44.0"
    -
    +
  • - +
  • So this bot is just like Baiduspider, and I need to add it to the nginx rate limiting

  • + +
  • I’ll also add it to Tomcat’s Crawler Session Manager Valve to force the re-use of a common Tomcat sesssion for all crawlers just in case

  • + +
  • Generate a list of all affiliations in CGSpace to send to Mohamed Salem to compare with the list on MEL (sorting the list by most occurrences):

    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where resource_type_id=2 and metadata_field_id=211 group by text_value order by count desc) to /tmp/affiliations.csv with csv header
     COPY 4518
     dspace=# \q
     $ csvcut -c 1 < /tmp/affiliations.csv > /tmp/affiliations-1.csv
    -
    +
  • -

    2018-07-13

    +
  • Generate a list of affiliations for Peter and Abenet to go over so we can batch correct them before we deploy the new data visualization dashboard:

    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv header;
     COPY 4518
    -
    +
  • +

    2018-07-15

    @@ -506,8 +489,8 @@ COPY 4518
  • Peter had asked a question about how mapped items are displayed in the Altmetric dashboard
  • For example, 1056882810 is mapped to four collections, but only shows up in one “department” in their dashboard
  • Altmetric help said that according to OAI that item is only in one department
  • -
  • I noticed that indeed there was only one collection listed, so I forced an OAI re-import on CGSpace:
  • - + +
  • I noticed that indeed there was only one collection listed, so I forced an OAI re-import on CGSpace:

    $ dspace oai import -c
     OAI 2.0 manager action started
    @@ -522,38 +505,34 @@ Full import
     Total: 73925 items
     Purging cached OAI responses.
     OAI 2.0 manager action ended. It took 697 seconds.
    -
    +
  • - +
  • Now I see four colletions in OAI for that item!

  • + +
  • I need to ask the dspace-tech mailing list if the nightly OAI import catches the case of old items that have had metadata or mappings change

  • + +
  • ICARDA sent me a list of the ORCID iDs they have in the MEL system and it looks like almost 150 are new and unique to us!

    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
     1020
     $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
     1158
    -
    +
  • - +
  • I combined the two lists and regenerated the names for all our the ORCID iDs using my resolve-orcids.py script:

    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2018-07-15-orcid-ids.txt
     $ ./resolve-orcids.py -i /tmp/2018-07-15-orcid-ids.txt -o /tmp/2018-07-15-resolved-orcids.txt -d
    -
    +
  • - +
  • Then I added the XML formatting for controlled vocabularies, sorted the list with GNU sort in vim via % !sort and then checked the formatting with tidy:

    $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
    -
    +
  • -

    2018-07-18

    @@ -565,20 +544,20 @@ $ ./resolve-orcids.py -i /tmp/2018-07-15-orcid-ids.txt -o /tmp/2018-07-15-resolv
  • I suggested that we should have a wider meeting about this, and that I would post that on Yammer
  • I was curious about how and when Altmetric harvests the OAI, so I looked in nginx’s OAI log
  • For every day in the past week I only see about 50 to 100 requests per day, but then about nine days ago I see 1500 requsts
  • -
  • In there I see two bots making about 750 requests each, and this one is probably Altmetric:
  • - + +
  • In there I see two bots making about 750 requests each, and this one is probably Altmetric:

    178.33.237.157 - - [09/Jul/2018:17:00:46 +0000] "GET /oai/request?verb=ListRecords&resumptionToken=oai_dc////100 HTTP/1.1" 200 58653 "-" "Apache-HttpClient/4.5.2 (Java/1.8.0_121)"
     178.33.237.157 - - [09/Jul/2018:17:01:11 +0000] "GET /oai/request?verb=ListRecords&resumptionToken=oai_dc////200 HTTP/1.1" 200 67950 "-" "Apache-HttpClient/4.5.2 (Java/1.8.0_121)"
     ...
     178.33.237.157 - - [09/Jul/2018:22:10:39 +0000] "GET /oai/request?verb=ListRecords&resumptionToken=oai_dc////73900 HTTP/1.1" 20 0 25049 "-" "Apache-HttpClient/4.5.2 (Java/1.8.0_121)"
    -
    +
  • - +
  • So if they are getting 100 records per OAI request it would take them 739 requests

  • + +
  • I wonder if I should add this user agent to the Tomcat Crawler Session Manager valve… does OAI use Tomcat sessions?

  • + +
  • Appears not:

    $ http --print Hh 'https://cgspace.cgiar.org/oai/request?verb=ListRecords&resumptionToken=oai_dc////100'
     GET /oai/request?verb=ListRecords&resumptionToken=oai_dc////100 HTTP/1.1
    @@ -600,7 +579,8 @@ Vary: Accept-Encoding
     X-Content-Type-Options: nosniff
     X-Frame-Options: SAMEORIGIN
     X-XSS-Protection: 1; mode=block
    -
    +
  • +

    2018-07-19

    @@ -620,44 +600,45 @@ X-XSS-Protection: 1; mode=block + +
  • For future reference, as I had previously noted in 2018-04, sort options are configured in dspace.cfg, for example:

    webui.itemlist.sort-option.3 = dateaccessioned:dc.date.accessioned:date
    -
    +
  • -

    2018-07-23

    + +
  • I looked in the database to see the breakdown of date formats used in dc.date.issued, ie YYYY, YYYY-MM, or YYYY-MM-DD:

    dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ '^[0-9]{4}$';
    - count
    +count
     -------
    - 53292
    +53292
     (1 row)
     dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ '^[0-9]{4}-[0-9]{2}$';
    - count
    +count
     -------
    -  3818
    +3818
     (1 row)
     dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ '^[0-9]{4}-[0-9]{2}-[0-9]{2}$';
    - count
    +count
     -------
    - 17357
    -
    +17357 +
  • -

    2018-07-26

    diff --git a/docs/2018-08/index.html b/docs/2018-08/index.html index d5f4a8f32..599ad62ef 100644 --- a/docs/2018-08/index.html +++ b/docs/2018-08/index.html @@ -11,18 +11,21 @@ DSpace Test had crashed at some point yesterday morning and I see the following in dmesg: - [Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB - Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight + From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat’s + I’m not sure why Tomcat didn’t crash with an OutOfMemoryError… + Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core + The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes + I ran all system updates on DSpace Test and rebooted it " /> @@ -37,21 +40,24 @@ I ran all system updates on DSpace Test and rebooted it DSpace Test had crashed at some point yesterday morning and I see the following in dmesg: - [Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB - Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight + From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat’s + I’m not sure why Tomcat didn’t crash with an OutOfMemoryError… + Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core + The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes + I ran all system updates on DSpace Test and rebooted it "/> - + @@ -133,21 +139,24 @@ I ran all system updates on DSpace Test and rebooted it

    2018-08-01

    +
  • DSpace Test had crashed at some point yesterday morning and I see the following in dmesg:

    [Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
     [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
     [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    -
    +
  • -

    2018-08-16

    +
  • Generate a list of the top 1,500 authors on CGSpace for Sisay so he can create the controlled vocabulary:

    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc limit 1500) to /tmp/2018-08-16-top-1500-authors.csv with csv; 
    -
    +
  • - +
  • Start working on adding the ORCID metadata to a handful of CIAT authors as requested by Elizabeth earlier this month

  • + +
  • I might need to overhaul the add-orcid-identifiers-csv.py script to be a little more robust about author order and ORCID metadata that might have been altered manually by editors after submission, as this script was written without that consideration

  • + +
  • After checking a few examples I see that checking only the text_value and place when adding ORCID fields is not enough anymore

  • + +
  • It was a sane assumption when I was initially migrating ORCID records from Solr to regular metadata, but now it seems that some authors might have been added or changed after item submission

  • + +
  • Now it is better to check if there is any existing ORCID identifier for a given author for the item…

  • + +
  • I will have to update my script to extract the ORCID identifier and search for that

  • + +
  • Re-create my local DSpace database using the latest PostgreSQL 9.6 Docker image and re-import the latest CGSpace dump:

    $ sudo docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
     $ createuser -h localhost -U postgres --pwprompt dspacetest
    @@ -220,7 +238,8 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
     $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest ~/Downloads/cgspace_2018-08-16.backup
     $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
     $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
    -
    +
  • +

    2018-08-19

    @@ -228,8 +247,8 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
  • Keep working on the CIAT ORCID identifiers from Elizabeth
  • In the spreadsheet she sent me there are some names with other versions in the database, so when it is obviously the same one (ie “Schultze-Kraft, Rainer” and “Schultze-Kraft, R.”) I will just tag them with ORCID identifiers too
  • This is less obvious and more error prone with names like “Peters” where there are many more authors
  • -
  • I see some errors in the variations of names as well, for example:
  • - + +
  • I see some errors in the variations of names as well, for example:

    Verchot, Louis
     Verchot, L
    @@ -238,12 +257,11 @@ Verchot, L.V
     Verchot, L.V.
     Verchot, LV
     Verchot, Louis V.
    -
    +
  • - +
  • I’ll just tag them all with Louis Verchot’s ORCID identifier…

  • + +
  • In the end, I’ll run the following CSV with my add-orcid-identifiers-csv.py script:

    dc.contributor.author,cg.creator.id
     "Campbell, Bruce",Bruce M Campbell: 0000-0002-0123-4859
    @@ -273,42 +291,37 @@ Verchot, Louis V.
     "Chirinda, Ngonidzashe",Ngonidzashe Chirinda: 0000-0002-4213-6294
     "Chirinda, Ngoni",Ngonidzashe Chirinda: 0000-0002-4213-6294
     "Ngonidzashe, Chirinda",Ngonidzashe Chirinda: 0000-0002-4213-6294
    -
    +
  • - +
  • The invocation would be:

    $ ./add-orcid-identifiers-csv.py -i 2018-08-16-ciat-orcid.csv -db dspace -u dspace -p 'fuuu'
    -
    +
  • - +
  • I ran the script on DSpace Test and CGSpace and tagged a total of 986 ORCID identifiers

  • + +
  • Looking at the list of author affialitions from Peter one last time

  • + +
  • I notice that I should add the Unicode character 0x00b4 (`) to my list of invalid characters to look for in Open Refine, making the latest version of the GREL expression being:

    or(
    -  isNotNull(value.match(/.*\uFFFD.*/)),
    -  isNotNull(value.match(/.*\u00A0.*/)),
    -  isNotNull(value.match(/.*\u200A.*/)),
    -  isNotNull(value.match(/.*\u2019.*/)),
    -  isNotNull(value.match(/.*\u00b4.*/))
    +isNotNull(value.match(/.*\uFFFD.*/)),
    +isNotNull(value.match(/.*\u00A0.*/)),
    +isNotNull(value.match(/.*\u200A.*/)),
    +isNotNull(value.match(/.*\u2019.*/)),
    +isNotNull(value.match(/.*\u00b4.*/))
     )
    -
    +
  • - +
  • This character all by itself is indicative of encoding issues in French, Italian, and Spanish names, for example: De´veloppement and Investigacio´n

  • + +
  • I will run the following on DSpace Test and CGSpace:

    $ ./fix-metadata-values.py -i /tmp/2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
     $ ./delete-metadata-values.py -i /tmp/2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
    -
    +
  • - +
  • Then force an update of the Discovery index on DSpace Test:

    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    @@ -316,11 +329,9 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     real    72m12.570s
     user    6m45.305s
     sys     2m2.461s
    -
    +
  • - +
  • And then on CGSpace:

    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    @@ -328,29 +339,26 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     real    79m44.392s
     user    8m50.730s
     sys     2m20.248s
    -
    +
  • - +
  • Run system updates on DSpace Test and reboot the server

  • + +
  • In unrelated news, I see some newish Russian bot making a few thousand requests per day and not re-using its XMLUI session:

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep '19/Aug/2018' | grep -c 5.9.6.51
     1553
     # grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-08-19
     1724
    -
    +
  • - +
  • I don’t even know how its possible for the bot to use MORE sessions than total requests…

  • + +
  • The user agent is:

    Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
    -
    +
  • -

    2018-08-20

    @@ -375,31 +383,37 @@ sys 2m20.248s

    2018-08-21

    +
  • Something must have happened, as the mvn package always takes about two hours now, stopping for a very long time near the end at this step:

    [INFO] Processing overlay [ id org.dspace.modules:xmlui-mirage2]
    -
    +
  • - +
  • It’s the same on DSpace Test, my local laptop, and CGSpace…

  • + +
  • It wasn’t this way before when I was constantly building the previous 5.8 branch with Atmire patches…

  • + +
  • I will restore the previous 5_x-dspace-5.8 and atmire-module-upgrades-5.8 branches to see if the build time is different there

  • + +
  • … it seems that the atmire-module-upgrades-5.8 branch still takes 1 hour and 23 minutes on my local machine…

  • + +
  • Let me try to build the old 5_x-prod-dspace-5.5 branch on my local machine and see how long it takes

  • + +
  • That one only took 13 minutes! So there is definitely something wrong with our 5.8 branch, now I should try vanilla DSpace 5.8

  • + +
  • I notice that the step this pauses at is:

    [INFO] --- maven-war-plugin:2.4:war (default-war) @ xmlui ---
    -
    +
  • -

    2018-08-23

    @@ -410,34 +424,31 @@ sys 2m20.248s
  • I sent a list of the top 1500 author affiliations on CGSpace to CodeObia so we can compare ours with the ones on MELSpace
  • Discuss CTA items with Sisay, he was trying to figure out how to do the collection mapping in combination with SAFBuilder
  • It appears that the web UI’s upload interface requires you to specify the collection, whereas the CLI interface allows you to omit the collection command line flag and defer to the collections file inside each item in the bundle
  • -
  • I imported the CTA items on CGSpace for Sisay:
  • - + +
  • I imported the CTA items on CGSpace for Sisay:

    $ dspace import -a -e s.webshet@cgiar.org -s /home/swebshet/ictupdates_uploads_August_21 -m /tmp/2018-08-23-cta-ictupdates.map
    -
    +
  • +

    2018-08-26

    + +
  • I already finished the Maven build, now I’ll take a backup of the PostgreSQL database and do a database cleanup just in case:

    $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-08-26-before-dspace-58.backup dspace
     $ dspace cleanup -v
    -
    +
  • - +
  • Now I can stop Tomcat and do the install:

    $ cd dspace/target/dspace-installer
     $ ant update clean_backups update_geolite
    -
    +
  • - +
  • After the successful Ant update I can run the database migrations:

    $ psql dspace dspace
     
    @@ -448,48 +459,55 @@ DELETE 1
     dspace=> \q
     
     $ dspace database migrate ignored
    -
    +
  • - +
  • Then I’ll run all system updates and reboot the server:

    $ sudo su -
     # apt update && apt full-upgrade
     # apt clean && apt autoclean && apt autoremove
     # reboot
    -
    +
  • - +
  • After reboot I logged in and cleared all the XMLUI caches and everything looked to be working fine

  • + +
  • Adam from WLE had asked a few weeks ago about getting the metadata for a bunch of items related to gender from 2013 until now

  • + +
  • They want a CSV with all metadata, which the Atmire Listings and Reports module can’t do

  • + +
  • I exported a list of items from Listings and Reports with the following criteria: from year 2013 until now, have WLE subject GENDER or GENDER POVERTY AND INSTITUTIONS, and CRP Water, Land and Ecosystems

  • + +
  • Then I extracted the Handle links from the report so I could export each item’s metadata as CSV

    $ grep -o -E "[0-9]{5}/[0-9]{0,5}" listings-export.txt > /tmp/iwmi-gender-items.txt
    -
    +
  • - +
  • Then on the DSpace server I exported the metadata for each item one by one:

    $ while read -r line; do dspace metadata-export -f "/tmp/${line/\//-}.csv" -i $line; sleep 2; done < /tmp/iwmi-gender-items.txt
    -
    +
  • -

    2018-08-29

    diff --git a/docs/2018-09/index.html b/docs/2018-09/index.html index 451e691f3..b90167215 100644 --- a/docs/2018-09/index.html +++ b/docs/2018-09/index.html @@ -29,7 +29,7 @@ I’ll update the DSpace role in our Ansible infrastructure playbooks and ru Also, I’ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again: "/> - + @@ -181,62 +181,59 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana + +
  • For example, given this test.yaml:

    version: 1
     
     requests:
    -  test:
    -    method: GET
    -    url: https://dspacetest.cgiar.org/rest/test
    -    validate:
    -      raw: "REST api is running."
    +test:
    +method: GET
    +url: https://dspacetest.cgiar.org/rest/test
    +validate:
    +  raw: "REST api is running."
     
    -  login:
    -    url: https://dspacetest.cgiar.org/rest/login
    -    method: POST
    -    data:
    -      json: {"email":"test@dspace","password":"thepass"}
    +login:
    +url: https://dspacetest.cgiar.org/rest/login
    +method: POST
    +data:
    +  json: {"email":"test@dspace","password":"thepass"}
     
    -  status:
    -    url: https://dspacetest.cgiar.org/rest/status
    -    method: GET
    -    headers:
    -      rest-dspace-token: Value(login)
    +status:
    +url: https://dspacetest.cgiar.org/rest/status
    +method: GET
    +headers:
    +  rest-dspace-token: Value(login)
     
    -  logout:
    -    url: https://dspacetest.cgiar.org/rest/logout
    -    method: POST
    -    headers:
    -      rest-dspace-token: Value(login)
    +logout:
    +url: https://dspacetest.cgiar.org/rest/logout
    +method: POST
    +headers:
    +  rest-dspace-token: Value(login)
     
     # vim: set sw=2 ts=2:
    -
    +
  • - +
  • Works pretty well, though the DSpace logout always returns an HTTP 415 error for some reason

  • + +
  • We could eventually use this to test sanity of the API for creating collections etc

  • + +
  • A user is getting an error in her workflow:

    2018-09-10 07:26:35,551 ERROR org.dspace.submit.step.CompleteStep @ Caught exception in submission step: 
     org.dspace.authorize.AuthorizeException: Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:2 by user 3819
    -
    +
  • - +
  • Seems to be during submit step, because it’s workflow step 1…?

  • + +
  • Move some top-level CRP communities to be below the new CGIAR Research Programs and Platforms community:

    $ dspace community-filiator --set -p 10568/97114 -c 10568/51670
     $ dspace community-filiator --set -p 10568/97114 -c 10568/35409
     $ dspace community-filiator --set -p 10568/97114 -c 10568/3112
    -
    +
  • - +
  • Valerio contacted me to point out some issues with metadata on CGSpace, which I corrected in PostgreSQL:

    update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI Juornal';
     UPDATE 1
    @@ -248,48 +245,45 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id=226 and
     DELETE 17
     update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI';
     UPDATE 15
    -
    +
  • - +
  • Start working on adding metadata for access and usage rights that we started earlier in 2018 (and also in 2017)

  • + +
  • The current cg.identifier.status field will become “Access rights” and dc.rights will become “Usage rights”

  • + +
  • I have some work in progress on the 5_x-rights branch

  • + +
  • Linode said that CGSpace (linode18) had a high CPU load earlier today

  • + +
  • When I looked, I see it’s the same Russian IP that I noticed last month:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Sep/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -   1459 157.55.39.202
    -   1579 95.108.181.88
    -   1615 157.55.39.147
    -   1714 66.249.64.91
    -   1924 50.116.102.77
    -   3696 157.55.39.106
    -   3763 157.55.39.148
    -   4470 70.32.83.92
    -   4724 35.237.175.180
    -  14132 5.9.6.51
    -
    +1459 157.55.39.202 +1579 95.108.181.88 +1615 157.55.39.147 +1714 66.249.64.91 +1924 50.116.102.77 +3696 157.55.39.106 +3763 157.55.39.148 +4470 70.32.83.92 +4724 35.237.175.180 +14132 5.9.6.51 +
  • - +
  • And this bot is still creating more Tomcat sessions than Nginx requests (WTF?):

    # grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-09-10 
     14133
    -
    +
  • - +
  • The user agent is still the same:

    Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
    -
    +
  • - +
  • I added .*crawl.* to the Tomcat Session Crawler Manager Valve, so I’m not sure why the bot is creating so many sessions…

  • + +
  • I just tested that user agent on CGSpace and it does not create a new session:

    $ http --print Hh https://cgspace.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)'
     GET / HTTP/1.1
    @@ -313,29 +307,31 @@ X-Cocoon-Version: 2.2.0
     X-Content-Type-Options: nosniff
     X-Frame-Options: SAMEORIGIN
     X-XSS-Protection: 1; mode=block
    -
    +
  • -

    2018-09-12

    + +
  • Re-create my local Docker container for PostgreSQL data, but using a volume for the database data:

    $ sudo docker volume create --name dspacetest_data
     $ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
    -
    +
  • -

    2018-09-13

    @@ -347,53 +343,60 @@ $ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e
  • The dateStamp is most probably only updated when the item’s metadata changes, not its mappings, so if Altmetric is relying on that we’re in a tricky spot
  • We need to make sure that our OAI isn’t publicizing stale data… I was going to post something on the dspace-tech mailing list, but never did
  • Linode says that CGSpace (linode18) has had high CPU for the past two hours
  • -
  • The top IP addresses today are:
  • - + +
  • The top IP addresses today are:

    # zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E "13/Sep/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10                                                                                                
    -     32 46.229.161.131
    -     38 104.198.9.108
    -     39 66.249.64.91
    -     56 157.55.39.224
    -     57 207.46.13.49
    -     58 40.77.167.120
    -     78 169.255.105.46
    -    702 54.214.112.202
    -   1840 50.116.102.77
    -   4469 70.32.83.92
    -
    + 32 46.229.161.131 + 38 104.198.9.108 + 39 66.249.64.91 + 56 157.55.39.224 + 57 207.46.13.49 + 58 40.77.167.120 + 78 169.255.105.46 +702 54.214.112.202 +1840 50.116.102.77 +4469 70.32.83.92 +
  • - +
  • And the top two addresses seem to be re-using their Tomcat sessions properly:

    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=70.32.83.92' dspace.log.2018-09-13 | sort | uniq
     7
     $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-13 | sort | uniq
     2
    -
    +
  • - +
  • So I’m not sure what’s going on

  • + +
  • Valerio asked me if there’s a way to get the page views and downloads from CGSpace

  • + +
  • I said no, but that we might be able to piggyback on the Atmire statlet REST API

  • + +
  • For example, when you expand the “statlet” at the bottom of an item like 1056897103 you can see the following request in the browser console:

    https://cgspace.cgiar.org/rest/statlets?handle=10568/97103
    -
    +
  • -

    2018-09-14

    @@ -440,51 +443,46 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
  • I want to explore creating a thin API to make the item view and download stats available from Solr so CodeObia can use them in the AReS explorer
  • Currently CodeObia is exploring using the Atmire statlets internal API, but I don’t really like that…
  • There are some example queries on the DSpace Solr wiki
  • -
  • For example, this query returns 1655 rows for item 1056810630:
  • - + +
  • For example, this query returns 1655 rows for item 1056810630:

    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false'
    -
    +
  • - +
  • The id in the Solr query is the item’s database id (get it from the REST API or something)

  • + +
  • Next, I adopted a query to get the downloads and it shows 889, which is similar to the number Atmire’s statlet shows, though the query logic here is confusing:

    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=-(bundleName:[*+TO+*]-bundleName:ORIGINAL)&fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
    -
    +
  • - + +
  • What the shit, I think I’m right: the simplified logic in this query returns the same 889:

    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=bundleName:ORIGINAL&fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
    -
    +
  • - +
  • And if I simplify the statistics_type logic the same way, it still returns the same 889!

    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=bundleName:ORIGINAL&fq=statistics_type:view'
    -
    +
  • - +
  • As for item views, I suppose that’s just the same query, minus the bundleName:ORIGINAL:

    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=-bundleName:ORIGINAL&fq=statistics_type:view'
    -
    +
  • -

    2018-09-18

    @@ -492,28 +490,27 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09- + +
  • After deploying on DSpace Test I can then get the stats for an item using its ID:

    $ http -b 'https://dspacetest.cgiar.org/rest/statistics/item?id=110988'
     {
    -    "downloads": 2,
    -    "id": 110988,
    -    "views": 15
    +"downloads": 2,
    +"id": 110988,
    +"views": 15
     }
    -
    +
  • - +
  • The numbers are different than those that come from Atmire’s statlets for some reason, but as I’m querying Solr directly, I have no idea where their numbers come from!

  • + +
  • Moayad from CodeObia asked if I could make the API be able to paginate over all items, for example: /statistics?limit=100&page=1

  • + +
  • Getting all the item IDs from PostgreSQL is certainly easy:

    dspace=# select item_id from item where in_archive is True and withdrawn is False and discoverable is True;
    -
    +
  • -

    2018-09-19

    @@ -527,24 +524,24 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09- + +
  • I researched the Solr filterCache size and I found out that the formula for calculating the potential memory use of each entry in the cache is:

    ((maxDoc/8) + 128) * (size_defined_in_solrconfig.xml)
    -
    +
  • - +
  • Which means that, for our statistics core with 149 million documents, each entry in our filterCache would use 8.9 GB!

    ((149374568/8) + 128) * 512 = 9560037888 bytes (8.9 GB)
    -
    +
  • -

    2018-09-21

    @@ -577,8 +574,8 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09- + +
  • It appears SQLite doesn’t support FULL OUTER JOIN so some people on StackOverflow have emulated it with LEFT JOIN and UNION:

    > SELECT views.views, views.id, downloads.downloads, downloads.id FROM itemviews views
     LEFT JOIN itemdownloads downloads USING(id)
    @@ -586,12 +583,11 @@ UNION ALL
     SELECT views.views, views.id, downloads.downloads, downloads.id FROM itemdownloads downloads
     LEFT JOIN itemviews views USING(id)
     WHERE views.id IS NULL;
    -
    +
  • - +
  • This “works” but the resulting rows are kinda messy so I’d have to do extra logic in Python

  • + +
  • Maybe we can use one “items” table with defaults values and UPSERT (aka insert… on conflict … do update):

    sqlite> CREATE TABLE items(id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0);
     sqlite> INSERT INTO items(id, views) VALUES(0, 52);
    @@ -600,29 +596,31 @@ sqlite> INSERT INTO items(id, downloads) VALUES(1, 176) ON CONFLICT(id) DO UP
     sqlite> INSERT INTO items(id, views) VALUES(0, 78) ON CONFLICT(id) DO UPDATE SET views=78;
     sqlite> INSERT INTO items(id, views) VALUES(0, 3) ON CONFLICT(id) DO UPDATE SET downloads=3;
     sqlite> INSERT INTO items(id, views) VALUES(0, 7) ON CONFLICT(id) DO UPDATE SET downloads=excluded.views;
    -
    +
  • - +
  • This totally works!

  • + +
  • Note the special excluded.views form! See SQLite’s lang_UPSERT documentation

  • + +
  • Oh nice, I finally finished the Falcon API route to page through all the results using SQLite’s amazing LIMIT and OFFSET support

  • + +
  • But when I deployed it on my Ubuntu 16.04 environment I realized Ubuntu’s SQLite is old and doesn’t support UPSERT, so my indexing doesn’t work…

  • + +
  • Apparently UPSERT came in SQLite 3.24.0 (2018-06-04), and Ubuntu 16.04 has 3.11.0

  • + +
  • Ok this is hilarious, I manually downloaded the libsqlite3 3.24.0 deb from Ubuntu 18.10 “cosmic” and installed it in Ubnutu 16.04 and now the Python indexer.py works

  • + +
  • This is definitely a dirty hack, but the list of packages we use that depend on libsqlite3-0 in Ubuntu 16.04 are actually pretty few:

    # apt-cache rdepends --installed libsqlite3-0 | sort | uniq
    -  gnupg2
    -  libkrb5-26-heimdal
    -  libnss3
    -  libpython2.7-stdlib
    -  libpython3.5-stdlib
    -
    +gnupg2 +libkrb5-26-heimdal +libnss3 +libpython2.7-stdlib +libpython3.5-stdlib +
  • - +
  • I wonder if I could work around this by detecting the SQLite library version, for example on Ubuntu 16.04 after I replaced the library:

    # python3
     Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
    @@ -631,20 +629,21 @@ Type "help", "copyright", "credits" or "licen
     >>> import sqlite3
     >>> print(sqlite3.sqlite_version)
     3.24.0
    -
    +
  • - +
  • Or maybe I should just bite the bullet and migrate this to PostgreSQL, as it supports UPSERT since version 9.5 and also seems to have my new favorite LIMIT and OFFSET

  • + +
  • I changed the syntax of the SQLite stuff and PostgreSQL is working flawlessly with psycopg2… hmmm.

  • + +
  • For reference, creating a PostgreSQL database for testing this locally (though indexer.py will create the table):

    $ createdb -h localhost -U postgres -O dspacestatistics --encoding=UNICODE dspacestatistics
     $ createuser -h localhost -U postgres --pwprompt dspacestatistics
     $ psql -h localhost -U postgres dspacestatistics
     dspacestatistics=> CREATE TABLE IF NOT EXISTS items
     dspacestatistics-> (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0)
    -
    +
  • +

    2018-09-25

    @@ -656,55 +655,66 @@ dspacestatistics-> (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DE
  • I want to purge the bot hits from the Solr statistics core, as I am now realizing that I don’t give a shit about tens of millions of hits by Google and Bing indexing my shit every day (at least not in Solr!)
  • CGSpace’s Solr core has 150,000,000 documents in it… and it’s still pretty fast to query, but it’s really a maintenance and backup burden
  • DSpace Test currently has about 2,000,000 documents with isBot:true in its Solr statistics core, and the size on disk is 2GB (it’s not much, but I have to test this somewhere!)
  • -
  • According to the DSpace 5.x Solr documentation I can use dspace stats-util -f, so let’s try it:
  • - + +
  • According to the DSpace 5.x Solr documentation I can use dspace stats-util -f, so let’s try it:

    $ dspace stats-util -f
    -
    +
  • - +
  • The command comes back after a few seconds and I still see 2,000,000 documents in the statistics core with isBot:true

  • + +
  • I was just writing a message to the dspace-tech mailing list and then I decided to check the number of bot view events on DSpace Test again, and now it’s 201 instead of 2,000,000, and statistics core is only 30MB now!

  • + +
  • I will set the logBots = false property in dspace/config/modules/usage-statistics.cfg on DSpace Test and check if the number of isBot:true events goes up any more…

  • + +
  • I restarted the server with logBots = false and after it came back up I see 266 events with isBots:true (maybe they were buffered)… I will check again tomorrow

  • + +
  • After a few hours I see there are still only 266 view events with isBot:true on DSpace Test’s Solr statistics core, so I’m definitely going to deploy this on CGSpace soon

  • + +
  • Also, CGSpace currently has 60,089,394 view events with isBot:true in it’s Solr statistics core and it is 124GB!

  • + +
  • Amazing! After running dspace stats-util -f on CGSpace the Solr statistics core went from 124GB to 60GB, and now there are only 700 events with isBot:true so I should really disable logging of bot events!

  • + +
  • I’m super curious to see how the JVM heap usage changes…

  • + +
  • I made (and merged) a pull request to disable bot logging on the 5_x-prod branch (#387)

  • + +
  • Now I’m wondering if there are other bot requests that aren’t classified as bots because the IP lists or user agents are outdated

  • + +
  • DSpace ships a list of spider IPs, for example: config/spiders/iplists.com-google.txt

  • + +
  • I checked the list against all the IPs we’ve seen using the “Googlebot” useragent on CGSpace’s nginx access logs

  • + +
  • The first thing I learned is that shit tons of IPs in Russia, Ukraine, Ireland, Brazil, Portugal, the US, Canada, etc are pretending to be “Googlebot”…

  • + +
  • According to the Googlebot FAQ the domain name in the reverse DNS lookup should contain either googlebot.com or google.com

  • + +
  • In Solr this appears to be an appropriate query that I can maybe use later (returns 81,000 documents):

    *:* AND (dns:*googlebot.com. OR dns:*google.com.) AND isBot:false
    -
    +
  • - +
  • I translate that into a delete command using the /update handler:

    http://localhost:8081/solr/statistics/update?commit=true&stream.body=<delete><query>*:*+AND+(dns:*googlebot.com.+OR+dns:*google.com.)+AND+isBot:false</query></delete>
    -
    +
  • - +
  • And magically all those 81,000 documents are gone!

  • + +
  • After a few hours the Solr statistics core is down to 44GB on CGSpace!

  • + +
  • I did a major refactor and logic fix in the DSpace Statistics API’s indexer.py

  • + +
  • Basically, it turns out that using facet.mincount=1 is really beneficial for me because it reduces the size of the Solr result set, reduces the amount of data we need to ingest into PostgreSQL, and the API returns HTTP 404 Not Found for items without views or downloads anyways

  • + +
  • I deployed the new version on CGSpace and now it looks pretty good!

    Indexing item views (page 28 of 753)
     ...
     Indexing item downloads (page 260 of 260)
    -
    +
  • -

    2018-09-26

    @@ -720,68 +730,71 @@ Indexing item downloads (page 260 of 260) + +
  • I did a batch replacement of the access rights with my fix-metadata-values.py script on DSpace Test:

    $ ./fix-metadata-values.py -i /tmp/fix-access-status.csv -db dspace -u dspace -p 'fuuu' -f cg.identifier.status -t correct -m 206
    -
    +
  • - +
  • This changes “Open Access” to “Unrestricted Access” and “Limited Access” to “Restricted Access”

  • + +
  • After that I did a full Discovery reindex:

    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    77m3.755s
     user    7m39.785s
     sys     2m18.485s
    -
    +
  • -

    2018-09-27

    + +
  • Looking in the nginx logs around that time I see some new IPs that look like they are harvesting things:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "26/Sep/2018:(19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    295 34.218.226.147
    -    296 66.249.64.95
    -    350 157.55.39.185
    -    359 207.46.13.28
    -    371 157.55.39.85
    -    388 40.77.167.148
    -    444 66.249.64.93
    -    544 68.6.87.12
    -    834 66.249.64.91
    -    902 35.237.175.180
    -
    +295 34.218.226.147 +296 66.249.64.95 +350 157.55.39.185 +359 207.46.13.28 +371 157.55.39.85 +388 40.77.167.148 +444 66.249.64.93 +544 68.6.87.12 +834 66.249.64.91 +902 35.237.175.180 +
  • - +
  • 35.237.175.180 is on Google Cloud

  • + +
  • 68.6.87.12 is on Cox Communications in the US (?)

  • + +
  • These hosts are not using proper user agents and are not re-using their Tomcat sessions:

    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-09-26 | sort | uniq
     5423
     $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12' dspace.log.2018-09-26 | sort | uniq
     758
    -
    +
  • -

    2018-09-29

    @@ -789,90 +802,80 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12' dspace.log.2018-09-26 + +
  • I did batch replaces for both on CGSpace with my fix-metadata-values.py script:

    $ ./fix-metadata-values.py -i 2018-09-29-fix-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
     $ ./fix-metadata-values.py -i 2018-09-29-fix-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
    -
    +
  • - +
  • Afterwards I started a full Discovery re-index:

    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    -
    +
  • -

    2018-09-30

    + +
  • I think I should just batch export and update all languages…

    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2018-09-30-languages.csv with csv;
    -
    +
  • - +
  • Then I can simply delete the “Other” and “other” ones because that’s not useful at all:

    dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='Other';
     DELETE 6
     dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='other';
     DELETE 79
    -
    +
  • - +
  • Looking through the list I see some weird language codes like gh, so I checked out those items:

    dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
    - resource_id
    +resource_id
     -------------
    -       94530
    -       94529
    +   94530
    +   94529
     dspace=# SELECT handle,item_id FROM item, handle WHERE handle.resource_type_id=2 AND handle.resource_id = item.item_id AND handle.resource_id in (94530, 94529);
    -   handle    | item_id
    +handle    | item_id
     -------------+---------
    - 10568/91386 |   94529
    - 10568/91387 |   94530
    -
    +10568/91386 | 94529 +10568/91387 | 94530 +
  • - +
  • Those items are from Ghana, so the submitter apparently thought gh was a language… I can safely delete them:

    dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
     DELETE 2
    -
    +
  • - +
  • The next issue would be jn:

    dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jn';
    - resource_id
    +resource_id
     -------------
    -       94001
    -       94003
    +   94001
    +   94003
     dspace=# SELECT handle,item_id FROM item, handle WHERE handle.resource_type_id=2 AND handle.resource_id = item.item_id AND handle.resource_id in (94001, 94003);
    -   handle    | item_id
    +handle    | item_id
     -------------+---------
    - 10568/90868 |   94001
    - 10568/90870 |   94003
    -
    +10568/90868 | 94001 +10568/90870 | 94003 +
  • - +
  • Those items are about Japan, so I will update them to be ja

  • + +
  • Other replacements:

    DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
     UPDATE metadatavalue SET text_value='fr' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='fn';
    @@ -880,10 +883,9 @@ UPDATE metadatavalue SET text_value='hi' WHERE resource_type_id=2 AND metadata_f
     UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='Ja';
     UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jn';
     UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jp';
    -
    +
  • - diff --git a/docs/2018-10/index.html b/docs/2018-10/index.html index 0a7e6321a..30e09ec31 100644 --- a/docs/2018-10/index.html +++ b/docs/2018-10/index.html @@ -25,7 +25,7 @@ I created a GitHub issue to track this #389, because I’m super busy in Nai Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now "/> - + @@ -114,106 +114,93 @@ I created a GitHub issue to track this #389, because I’m super busy in Nai

    2018-10-03

    +
  • I see Moayad was busy collecting item views and downloads from CGSpace yesterday:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | awk '{print $1}
     ' | sort | uniq -c | sort -n | tail -n 10
    -    933 40.77.167.90
    -    971 95.108.181.88
    -   1043 41.204.190.40
    -   1454 157.55.39.54
    -   1538 207.46.13.69
    -   1719 66.249.64.61
    -   2048 50.116.102.77
    -   4639 66.249.64.59
    -   4736 35.237.175.180
    - 150362 34.218.226.147
    -
    +933 40.77.167.90 +971 95.108.181.88 +1043 41.204.190.40 +1454 157.55.39.54 +1538 207.46.13.69 +1719 66.249.64.61 +2048 50.116.102.77 +4639 66.249.64.59 +4736 35.237.175.180 +150362 34.218.226.147 +
  • - +
  • Of those, about 20% were HTTP 500 responses (!):

    $ zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | grep 34.218.226.147 | awk '{print $9}' | sort -n | uniq -c
    - 118927 200
    -  31435 500
    -
    +118927 200 +31435 500 +
  • - +
  • I added Phil Thornton and Sonal Henson’s ORCID identifiers to the controlled vocabulary for cg.creator.orcid and then re-generated the names using my resolve-orcids.py script:

    $ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq > 2018-10-03-orcids.txt
     $ ./resolve-orcids.py -i 2018-10-03-orcids.txt -o 2018-10-03-names.txt -d
    -
    +
  • - +
  • I found a new corner case error that I need to check, given and family names deactivated:

    Looking up the names associated with ORCID iD: 0000-0001-7930-5752
     Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
    -
    +
  • - +
  • It appears to be Jim Lorenzen… I need to check that later!

  • + +
  • I merged the changes to the 5_x-prod branch (#390)

  • + +
  • Linode sent another alert about CPU usage on CGSpace (linode18) this evening

  • + +
  • It seems that Moayad is making quite a lot of requests today:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Oct/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -   1594 157.55.39.160
    -   1627 157.55.39.173
    -   1774 136.243.6.84
    -   4228 35.237.175.180
    -   4497 70.32.83.92
    -   4856 66.249.64.59
    -   7120 50.116.102.77
    -  12518 138.201.49.199
    -  87646 34.218.226.147
    - 111729 213.139.53.62
    -
    +1594 157.55.39.160 +1627 157.55.39.173 +1774 136.243.6.84 +4228 35.237.175.180 +4497 70.32.83.92 +4856 66.249.64.59 +7120 50.116.102.77 +12518 138.201.49.199 +87646 34.218.226.147 +111729 213.139.53.62 +
  • - +
  • But in super positive news, he says they are using my new dspace-statistics-api and it’s MUCH faster than using Atmire CUA’s internal “restlet” API

  • + +
  • I don’t recognize the 138.201.49.199 IP, but it is in Germany (Hetzner) and appears to be paginating over some browse pages and downloading bitstreams:

    # grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /[a-z]+' | sort | uniq -c
    -   8324 GET /bitstream
    -   4193 GET /handle
    -
    +8324 GET /bitstream +4193 GET /handle +
  • - +
  • Suspiciously, it’s only grabbing the CGIAR System Office community (handle prefix 10947):

    # grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /handle/[0-9]{5}' | sort | uniq -c
    -      7 GET /handle/10568
    -   4186 GET /handle/10947
    -
    + 7 GET /handle/10568 +4186 GET /handle/10947 +
  • - +
  • The user agent is suspicious too:

    Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
    -
    +
  • - +
  • It’s clearly a bot and it’s not re-using its Tomcat session, so I will add its IP to the nginx bad bot list

  • + +
  • I looked in Solr’s statistics core and these hits were actually all counted as isBot:false (of course)… hmmm

  • + +
  • I tagged all of Sonal and Phil’s items with their ORCID identifiers on CGSpace using my add-orcid-identifiers.py script:

    $ ./add-orcid-identifiers-csv.py -i 2018-10-03-add-orcids.csv -db dspace -u dspace -p 'fuuu'
    -
    +
  • - +
  • Where 2018-10-03-add-orcids.csv contained:

    dc.contributor.author,cg.creator.id
     "Henson, Sonal P.",Sonal Henson: 0000-0002-2002-5462
    @@ -224,7 +211,8 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
     "Thornton, Philip K.",Philip Thornton: 0000-0002-1854-0182
     "Thornton, Phillip",Philip Thornton: 0000-0002-1854-0182
     "Thornton, Phillip K.",Philip Thornton: 0000-0002-1854-0182
    -
    +
  • +

    2018-10-04

    @@ -239,16 +227,16 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
  • I see there are other bundles we might need to pay attention to: TEXT, @_LOGO-COLLECTION_@, @_LOGO-COMMUNITY_@, etc…
  • On a hunch I dropped the statistics table and re-indexed and now those two items above have no downloads
  • So it’s fixed, but I’m not sure why!
  • -
  • Peter wants to know the number of API requests per month, which was about 250,000 in September (exluding statlet requests):
  • - + +
  • Peter wants to know the number of API requests per month, which was about 250,000 in September (exluding statlet requests):

    # zcat --force /var/log/nginx/{oai,rest}.log* | grep -E 'Sep/2018' | grep -c -v 'statlets'
     251226
    -
    +
  • -

    2018-10-05

    @@ -278,46 +266,49 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752 + +
  • When I tried to force them to be generated I got an error that I’ve never seen before:

    $ dspace filter-media -v -f -i 10568/97613
     org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: not authorized `/tmp/impdfthumb5039464037201498062.pdf' @ error/constitute.c/ReadImage/412.
    -
    +
  • - +
  • I see there was an update to Ubuntu’s ImageMagick on 2018-10-05, so maybe something changed or broke?

  • -
      <!--<policy domain="coder" rights="none" pattern="PDF" />-->
    -
    +
  • I get the same error when forcing filter-media to run on DSpace Test too, so it’s gotta be an ImageMagic bug

  • -

    2018-10-11

    + +
  • Generate a list of the top 1500 values for dc.subject so Sisay can start making a controlled vocabulary for it:

    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-10-11-top-1500-subject.csv WITH CSV HEADER;
     COPY 1500
    -
    +
  • - +
  • Give WorldFish advice about Handles because they are talking to some company called KnowledgeArc who recommends they do not use Handles!

  • + +
  • Last week I emailed Altmetric to ask if their software would notice mentions of our Handle in the format “handle:1056880775” because I noticed that the Land Portal does this

  • + +
  • Altmetric support responded to say no, but the reason is that Land Portal is doing even more strange stuff by not using <meta> tags in their page header, and using “dct:identifier” property instead of “dc:identifier”

  • + +
  • I re-created my local DSpace databse container using podman instead of Docker:

    $ mkdir -p ~/.local/lib/containers/volumes/dspacedb_data
     $ sudo podman create --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
    @@ -328,30 +319,29 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
     $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2018-10-11.backup
     $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
     $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
    -
    +
  • - +
  • I tried to make an Artifactory in podman, but it seems to have problems because Artifactory is distributed on the Bintray repository

  • + +
  • I can pull the docker.bintray.io/jfrog/artifactory-oss:latest image, but not start it

  • + +
  • I decided to use a Sonatype Nexus repository instead:

    $ mkdir -p ~/.local/lib/containers/volumes/nexus_data
     $ sudo podman run --name nexus -d -v /home/aorth/.local/lib/containers/volumes/nexus_data:/nexus_data -p 8081:8081 sonatype/nexus3
    -
    +
  • - +
  • With a few changes to my local Maven settings.xml it is working well

  • + +
  • Generate a list of the top 10,000 authors for Peter Ballantyne to look through:

    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 3 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 10000) to /tmp/2018-10-11-top-10000-authors.csv WITH CSV HEADER;
     COPY 10000
    -
    +
  • -

    2018-10-13

    @@ -359,27 +349,24 @@ COPY 10000 + +
  • I first facet by blank, trim whitespace, and then check for weird characters that might be indicative of encoding issues with this GREL:

    or(
    -  isNotNull(value.match(/.*\uFFFD.*/)),
    -  isNotNull(value.match(/.*\u00A0.*/)),
    -  isNotNull(value.match(/.*\u200A.*/)),
    -  isNotNull(value.match(/.*\u2019.*/)),
    -  isNotNull(value.match(/.*\u00b4.*/))
    +isNotNull(value.match(/.*\uFFFD.*/)),
    +isNotNull(value.match(/.*\u00A0.*/)),
    +isNotNull(value.match(/.*\u200A.*/)),
    +isNotNull(value.match(/.*\u2019.*/)),
    +isNotNull(value.match(/.*\u00b4.*/))
     )
    -
    +
  • - +
  • Then I exported and applied them on my local test server:

    $ ./fix-metadata-values.py -i 2018-10-11-top-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t CORRECT -m 3
    -
    +
  • -

    2018-10-14

    @@ -387,26 +374,32 @@ COPY 10000 + +
  • Apply Peter’s 746 author corrections on CGSpace and DSpace Test using my fix-metadata-values.py script:

    $ ./fix-metadata-values.py -i /tmp/2018-10-11-top-authors.csv -f dc.contributor.author -t CORRECT -m 3 -db dspace -u dspace -p 'fuuu'
    -
    +
  • -

    2018-10-15

    @@ -423,8 +416,8 @@ COPY 10000
  • He said he actually wants to test creation of communities, collections, etc, so I had to make him a super admin for now
  • I told him we need to think about the workflow more seriously in the future
  • -
  • I ended up having some issues with podman and went back to Docker, so I had to re-create my containers:
  • - + +
  • I ended up having some issues with podman and went back to Docker, so I had to re-create my containers:

    $ sudo docker run --name nexus --network dspace-build -d -v /home/aorth/.local/lib/containers/volumes/nexus_data:/nexus_data -p 8081:8081 sonatype/nexus3
     $ sudo docker run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
    @@ -434,21 +427,20 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
     $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2018-10-11.backup
     $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
     $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
    -
    +
  • +

    2018-10-16

    +
  • Generate a list of the schema on CGSpace so CodeObia can compare with MELSpace:

    dspace=# \copy (SELECT (CASE when metadata_schema_id=1 THEN 'dc' WHEN metadata_schema_id=2 THEN 'cg' END) AS schema, element, qualifier, scope_note FROM metadatafieldregistry where metadata_schema_id IN (1,2)) TO /tmp/cgspace-schema.csv WITH CSV HEADER;
    -
    +
  • - +
  • Talking to the CodeObia guys about the REST API I started to wonder why it’s so slow and how I can quantify it in order to ask the dspace-tech mailing list for help profiling it

  • + +
  • Interestingly, the speed doesn’t get better after you request the same thing multiple times–it’s consistently bad on both CGSpace and DSpace Test!

    $ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
     ...
    @@ -465,13 +457,13 @@ $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,b
     0.23s user 0.04s system 1% cpu 16.460 total
     0.24s user 0.04s system 1% cpu 21.043 total
     0.22s user 0.04s system 1% cpu 17.132 total
    -
    +
  • - +
  • I should note that at this time CGSpace is using Oracle Java and DSpace Test is using OpenJDK (both version 8)

  • + +
  • I wonder if the Java garbage collector is important here, or if there are missing indexes in PostgreSQL?

  • + +
  • I switched DSpace Test to the G1GC garbage collector and tried again and now the results are worse!

    $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
     ...
    @@ -480,11 +472,9 @@ $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,b
     0.24s user 0.02s system 1% cpu 22.496 total
     0.22s user 0.03s system 1% cpu 22.720 total
     0.23s user 0.03s system 1% cpu 22.632 total
    -
    +
  • - +
  • If I make a request without the expands it is ten time faster:

    $ time http --print h 'https://dspacetest.cgiar.org/rest/items?limit=100&offset=0'
     ...
    @@ -492,10 +482,9 @@ $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,b
     0.22s user 0.03s system 8% cpu 2.896 total
     0.21s user 0.05s system 9% cpu 2.787 total
     0.23s user 0.02s system 8% cpu 2.896 total
    -
    +
  • -

    2018-10-17

    @@ -503,8 +492,8 @@ $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,b + +
  • I manually went through and looked at the existing values and updated them in several batches:

    UPDATE metadatavalue SET text_value='CC-BY-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%CC BY %';
     UPDATE metadatavalue SET text_value='CC-BY-NC-ND-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%BY-NC-ND%' AND text_value LIKE '%by-nc-nd%';
    @@ -522,34 +511,35 @@ UPDATE metadatavalue SET text_value='CC-BY-3.0' WHERE resource_type_id=2 AND met
     UPDATE metadatavalue SET text_value='CC-BY-ND-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND resource_id=78184;
     UPDATE metadatavalue SET text_value='CC-BY' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value NOT LIKE '%zero%' AND text_value NOT LIKE '%CC0%' AND text_value LIKE '%Attribution %' AND text_value NOT LIKE '%CC-%';
     UPDATE metadatavalue SET text_value='CC-BY-NC-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND resource_id=78564;
    -
    +
  • - +
  • I updated the fields on CGSpace and then started a re-index of Discovery

  • + +
  • We also need to re-think the dc.rights field in the submission form: we should probably use a popup controlled vocabulary and list the Creative Commons values with version numbers and allow the user to enter their own (like the ORCID identifier field)

  • + +
  • Ask Jane if we can use some of the BDP money to host AReS explorer on a more powerful server

  • + +
  • IWMI sent me a list of new ORCID identifiers for their staff so I combined them with our list, updated the names with my resolve-orcids.py script, and regenerated the controlled vocabulary:

    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json MEL\ ORCID_V2.json 2018-10-17-IWMI-ORCIDs.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq >
     2018-10-17-orcids.txt
     $ ./resolve-orcids.py -i 2018-10-17-orcids.txt -o 2018-10-17-names.txt -d
     $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
    -
    +
  • - +
  • I also decided to add the ORCID identifiers that MEL had sent us a few months ago…

  • + +
  • One problem I had with the resolve-orcids.py script is that one user seems to have disabled their profile data since we last updated:

    Looking up the names associated with ORCID iD: 0000-0001-7930-5752
     Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
    -
    +
  • -

    2018-10-18

    @@ -557,79 +547,78 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752 + +
  • I upgraded PostgreSQL to 9.6 on DSpace Test using Ansible, then had to manually migrate from 9.5 to 9.6:

    # su - postgres
     $ /usr/lib/postgresql/9.6/bin/pg_upgrade -b /usr/lib/postgresql/9.5/bin -B /usr/lib/postgresql/9.6/bin -d /var/lib/postgresql/9.5/main -D /var/lib/postgresql/9.6/main -o ' -c config_file=/etc/postgresql/9.5/main/postgresql.conf' -O ' -c config_file=/etc/postgresql/9.6/main/postgresql.conf'
     $ exit
     # systemctl start postgresql
     # dpkg -r postgresql-9.5 postgresql-client-9.5 postgresql-contrib-9.5
    -
    +
  • +

    2018-10-19

    + +
  • Looking at the nginx logs around that time I see the following IPs making the most requests:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "19/Oct/2018:(12|13|14|15)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    361 207.46.13.179
    -    395 181.115.248.74
    -    485 66.249.64.93
    -    535 157.55.39.213
    -    536 157.55.39.99
    -    551 34.218.226.147
    -    580 157.55.39.173
    -   1516 35.237.175.180
    -   1629 66.249.64.91
    -   1758 5.9.6.51
    -
    +361 207.46.13.179 +395 181.115.248.74 +485 66.249.64.93 +535 157.55.39.213 +536 157.55.39.99 +551 34.218.226.147 +580 157.55.39.173 +1516 35.237.175.180 +1629 66.249.64.91 +1758 5.9.6.51 +
  • -

    2018-10-20

    + +
  • This means our existing Solr configuration doesn’t run in Solr 5.5:

    $ sudo docker pull solr:5
     $ sudo docker run --name my_solr -v ~/dspace/solr/statistics/conf:/tmp/conf -d -p 8983:8983 -t solr:5
     $ sudo docker logs my_solr
     ...
     ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics] Caused by: solr.IntField
    -
    +
  • - +
  • Apparently a bunch of variable types were removed in Solr 5

  • + +
  • So for now it’s actually a huge pain in the ass to run the tests for my dspace-statistics-api

  • + +
  • Linode sent a message that the CPU usage was high on CGSpace (linode18) last night

  • + +
  • According to the nginx logs around that time it was 5.9.6.51 (MegaIndex) again:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "20/Oct/2018:(14|15|16)" | awk '{print $1}' | sort
    - | uniq -c | sort -n | tail -n 10
    -    249 207.46.13.179
    -    250 157.55.39.173
    -    301 54.166.207.223
    -    303 157.55.39.213
    -    310 66.249.64.95
    -    362 34.218.226.147
    -    381 66.249.64.93
    -    415 35.237.175.180
    -   1205 66.249.64.91
    -   1227 5.9.6.51
    -
    +| uniq -c | sort -n | tail -n 10 +249 207.46.13.179 +250 157.55.39.173 +301 54.166.207.223 +303 157.55.39.213 +310 66.249.64.95 +362 34.218.226.147 +381 66.249.64.93 +415 35.237.175.180 +1205 66.249.64.91 +1227 5.9.6.51 +
  • - +
  • This bot is only using the XMLUI and it does not seem to be re-using its sessions:

    # grep -c 5.9.6.51 /var/log/nginx/*.log
     /var/log/nginx/access.log:9323
    @@ -640,17 +629,14 @@ ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics]
     /var/log/nginx/statistics.log:0
     # grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-10-20 | sort | uniq
     8915
    -
    +
  • - +
  • Last month I added “crawl” to the Tomcat Crawler Session Manager Valve’s regular expression matching, and it seems to be working for MegaIndex’s user agent:

    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1' User-Agent:'"Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)"'
    -
    +
  • -

    2018-10-21

    @@ -664,27 +650,29 @@ ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics] + +
  • We will still need to do a batch update of the dc.identifier.uri and other fields in the database:

    dspace=# UPDATE metadatavalue SET text_value=replace(text_value, 'http://', 'https://') WHERE resource_type_id=2 AND text_value LIKE 'http://hdl.handle.net%';
    -
    +
  • - +
  • While I was doing that I found two items using CGSpace URLs instead of handles in their dc.identifier.uri so I corrected those

  • + +
  • I also found several items that had invalid characters or multiple Handles in some related URL field like cg.link.reference so I corrected those too

  • + +
  • Improve the usage rights on the submission form by adding a default selection with no value as well as a better hint to look for the CC license on the publisher page or in the PDF (#398)

  • + +
  • I deployed the changes on CGSpace, ran all system updates, and rebooted the server

  • + +
  • Also, I updated all Handles in the database to use HTTPS:

    dspace=# UPDATE metadatavalue SET text_value=replace(text_value, 'http://', 'https://') WHERE resource_type_id=2 AND text_value LIKE 'http://hdl.handle.net%';
     UPDATE 76608
    -
    +
  • -

    2018-10-23

    @@ -693,18 +681,16 @@ UPDATE 76608
  • Improve the usage rights (dc.rights) on CGSpace again by adding the long names in the submission form, as well as adding versio 3.0 and Creative Commons Zero (CC0) public domain license (#399)
  • Add “usage rights” to the XMLUI item display (#400)
  • I emailed the MARLO guys to ask if they can send us a dump of rights data and Handles from their system so we can tag our older items on CGSpace
  • -
  • Testing REST login and logout via httpie because Felix from Earlham says he’s having issues:
  • - + +
  • Testing REST login and logout via httpie because Felix from Earlham says he’s having issues:

    $ http --print b POST 'https://dspacetest.cgiar.org/rest/login' email='testdeposit@cgiar.org' password=deposit
     acef8a4a-41f3-4392-b870-e873790f696b
     
     $ http POST 'https://dspacetest.cgiar.org/rest/logout' rest-dspace-token:acef8a4a-41f3-4392-b870-e873790f696b
    -
    +
  • - +
  • Also works via curl (login, check status, logout, check status):

    $ curl -H "Content-Type: application/json" --data '{"email":"testdeposit@cgiar.org", "password":"deposit"}' https://dspacetest.cgiar.org/rest/login
     e09fb5e1-72b0-4811-a2e5-5c1cd78293cc
    @@ -713,11 +699,11 @@ $ curl -X GET -H "Content-Type: application/json" -H "Accept: app
     $ curl -X POST -H "Content-Type: application/json" -H "rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc" https://dspacetest.cgiar.org/rest/logout
     $ curl -X GET -H "Content-Type: application/json" -H "Accept: application/json" -H "rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc" https://dspacetest.cgiar.org/rest/status
     {"okay":true,"authenticated":false,"email":null,"fullname":null,"token":null}%
    -
    +
  • -

    2018-10-24

    diff --git a/docs/2018-11/index.html b/docs/2018-11/index.html index 453dd705a..bdcc15e77 100644 --- a/docs/2018-11/index.html +++ b/docs/2018-11/index.html @@ -39,7 +39,7 @@ Send a note about my dspace-statistics-api to the dspace-tech mailing list Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage Today these are the top 10 IPs: "/> - + @@ -148,109 +148,105 @@ Today these are the top 10 IPs: + +
  • 84.38.130.177 is some new IP in Latvia that is only hitting the XMLUI, using the following user agent:

    Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.792.0 Safari/535.1
    -
    +
  • - +
  • They at least seem to be re-using their Tomcat sessions:

    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=84.38.130.177' dspace.log.2018-11-03
     342
    -
    +
  • - +
  • 50.116.102.77 is also a regular REST API user

  • + +
  • 40.77.167.175 and 207.46.13.156 seem to be Bing

  • + +
  • 138.201.52.218 seems to be on Hetzner in Germany, but is using this user agent:

    Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
    -
    +
  • - +
  • And it doesn’t seem they are re-using their Tomcat sessions:

    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=138.201.52.218' dspace.log.2018-11-03
     1243
    -
    +
  • - +
  • Ah, we’ve apparently seen this server exactly a year ago in 2017-11, making 40,000 requests in one day…

  • + +
  • I wonder if it’s worth adding them to the list of bots in the nginx config?

  • + +
  • Linode sent a mail that CGSpace (linode18) is using high outgoing bandwidth

  • + +
  • Looking at the nginx logs again I see the following top ten IPs:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Nov/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -   1979 50.116.102.77
    -   1980 35.237.175.180
    -   2186 207.46.13.156
    -   2208 40.77.167.175
    -   2843 66.249.64.63
    -   4220 84.38.130.177
    -   4537 70.32.83.92
    -   5593 66.249.64.61
    -  12557 78.46.89.18
    -  32152 66.249.64.59
    -
    +1979 50.116.102.77 +1980 35.237.175.180 +2186 207.46.13.156 +2208 40.77.167.175 +2843 66.249.64.63 +4220 84.38.130.177 +4537 70.32.83.92 +5593 66.249.64.61 +12557 78.46.89.18 +32152 66.249.64.59 +
  • - +
  • 78.46.89.18 is new since I last checked a few hours ago, and it’s from Hetzner with the following user agent:

    Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
    -
    +
  • - +
  • It’s making lots of requests, though actually it does seem to be re-using its Tomcat sessions:

    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
     8449
     $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03 | sort | uniq | wc -l
     1
    -
    +
  • - +
  • Updated on 2018-12-04 to correct the grep command above, as it was inaccurate and it seems the bot was actually already re-using its Tomcat sessions

  • + +
  • I could add this IP to the list of bot IPs in nginx, but it seems like a futile effort when some new IP could come along and do the same thing

  • + +
  • Perhaps I should think about adding rate limits to dynamic pages like /discover and /browse

  • + +
  • I think it’s reasonable for a human to click one of those links five or ten times a minute…

  • + +
  • To contrast, 78.46.89.18 made about 300 requests per minute for a few hours today:

    # grep 78.46.89.18 /var/log/nginx/access.log | grep -o -E '03/Nov/2018:[0-9][0-9]:[0-9][0-9]' | sort | uniq -c | sort -n | tail -n 20
    -    286 03/Nov/2018:18:02
    -    287 03/Nov/2018:18:21
    -    289 03/Nov/2018:18:23
    -    291 03/Nov/2018:18:27
    -    293 03/Nov/2018:18:34
    -    300 03/Nov/2018:17:58
    -    300 03/Nov/2018:18:22
    -    300 03/Nov/2018:18:32
    -    304 03/Nov/2018:18:12
    -    305 03/Nov/2018:18:13
    -    305 03/Nov/2018:18:24
    -    312 03/Nov/2018:18:39
    -    322 03/Nov/2018:18:17
    -    326 03/Nov/2018:18:38
    -    327 03/Nov/2018:18:16
    -    330 03/Nov/2018:17:57
    -    332 03/Nov/2018:18:19
    -    336 03/Nov/2018:17:56
    -    340 03/Nov/2018:18:14
    -    341 03/Nov/2018:18:18
    -
    +286 03/Nov/2018:18:02 +287 03/Nov/2018:18:21 +289 03/Nov/2018:18:23 +291 03/Nov/2018:18:27 +293 03/Nov/2018:18:34 +300 03/Nov/2018:17:58 +300 03/Nov/2018:18:22 +300 03/Nov/2018:18:32 +304 03/Nov/2018:18:12 +305 03/Nov/2018:18:13 +305 03/Nov/2018:18:24 +312 03/Nov/2018:18:39 +322 03/Nov/2018:18:17 +326 03/Nov/2018:18:38 +327 03/Nov/2018:18:16 +330 03/Nov/2018:17:57 +332 03/Nov/2018:18:19 +336 03/Nov/2018:17:56 +340 03/Nov/2018:18:14 +341 03/Nov/2018:18:18 +
  • -

    2018-11-04

    @@ -258,137 +254,127 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03 + +
  • Here are the top ten IPs active so far this morning:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Nov/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -   1083 2a03:2880:11ff:2::face:b00c
    -   1105 2a03:2880:11ff:d::face:b00c
    -   1111 2a03:2880:11ff:f::face:b00c
    -   1134 84.38.130.177
    -   1893 50.116.102.77
    -   2040 66.249.64.63
    -   4210 66.249.64.61
    -   4534 70.32.83.92
    -  13036 78.46.89.18
    -  20407 66.249.64.59
    -
    +1083 2a03:2880:11ff:2::face:b00c +1105 2a03:2880:11ff:d::face:b00c +1111 2a03:2880:11ff:f::face:b00c +1134 84.38.130.177 +1893 50.116.102.77 +2040 66.249.64.63 +4210 66.249.64.61 +4534 70.32.83.92 +13036 78.46.89.18 +20407 66.249.64.59 +
  • - +
  • 78.46.89.18 is back… and it is still actually re-using its Tomcat sessions:

    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
     8765
     $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04 | sort | uniq | wc -l
     1
    -
    +
  • - +
  • Updated on 2018-12-04 to correct the grep command and point out that the bot was actually re-using its Tomcat sessions properly

  • + +
  • Also, now we have a ton of Facebook crawlers:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Nov/2018" | grep "2a03:2880:11ff:" | awk '{print $1}' | sort | uniq -c | sort -n
    -    905 2a03:2880:11ff:b::face:b00c
    -    955 2a03:2880:11ff:5::face:b00c
    -    965 2a03:2880:11ff:e::face:b00c
    -    984 2a03:2880:11ff:8::face:b00c
    -    993 2a03:2880:11ff:3::face:b00c
    -    994 2a03:2880:11ff:7::face:b00c
    -   1006 2a03:2880:11ff:10::face:b00c
    -   1011 2a03:2880:11ff:4::face:b00c
    -   1023 2a03:2880:11ff:6::face:b00c
    -   1026 2a03:2880:11ff:9::face:b00c
    -   1039 2a03:2880:11ff:1::face:b00c
    -   1043 2a03:2880:11ff:c::face:b00c
    -   1070 2a03:2880:11ff::face:b00c
    -   1075 2a03:2880:11ff:a::face:b00c
    -   1093 2a03:2880:11ff:2::face:b00c
    -   1107 2a03:2880:11ff:d::face:b00c
    -   1116 2a03:2880:11ff:f::face:b00c
    -
    +905 2a03:2880:11ff:b::face:b00c +955 2a03:2880:11ff:5::face:b00c +965 2a03:2880:11ff:e::face:b00c +984 2a03:2880:11ff:8::face:b00c +993 2a03:2880:11ff:3::face:b00c +994 2a03:2880:11ff:7::face:b00c +1006 2a03:2880:11ff:10::face:b00c +1011 2a03:2880:11ff:4::face:b00c +1023 2a03:2880:11ff:6::face:b00c +1026 2a03:2880:11ff:9::face:b00c +1039 2a03:2880:11ff:1::face:b00c +1043 2a03:2880:11ff:c::face:b00c +1070 2a03:2880:11ff::face:b00c +1075 2a03:2880:11ff:a::face:b00c +1093 2a03:2880:11ff:2::face:b00c +1107 2a03:2880:11ff:d::face:b00c +1116 2a03:2880:11ff:f::face:b00c +
  • - +
  • They are really making shit tons of requests:

    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
     37721
    -
    +
  • - +
  • Updated on 2018-12-04 to correct the grep command to accurately show the number of requests

  • + +
  • Their user agent is:

    facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
    -
    +
  • - +
  • I will add it to the Tomcat Crawler Session Manager valve

  • + +
  • Later in the evening… ok, this Facebook bot is getting super annoying:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Nov/2018" | grep "2a03:2880:11ff:" | awk '{print $1}' | sort | uniq -c | sort -n
    -   1871 2a03:2880:11ff:3::face:b00c
    -   1885 2a03:2880:11ff:b::face:b00c
    -   1941 2a03:2880:11ff:8::face:b00c
    -   1942 2a03:2880:11ff:e::face:b00c
    -   1987 2a03:2880:11ff:1::face:b00c
    -   2023 2a03:2880:11ff:2::face:b00c
    -   2027 2a03:2880:11ff:4::face:b00c
    -   2032 2a03:2880:11ff:9::face:b00c
    -   2034 2a03:2880:11ff:10::face:b00c
    -   2050 2a03:2880:11ff:5::face:b00c
    -   2061 2a03:2880:11ff:c::face:b00c
    -   2076 2a03:2880:11ff:6::face:b00c
    -   2093 2a03:2880:11ff:7::face:b00c
    -   2107 2a03:2880:11ff::face:b00c
    -   2118 2a03:2880:11ff:d::face:b00c
    -   2164 2a03:2880:11ff:a::face:b00c
    -   2178 2a03:2880:11ff:f::face:b00c
    -
    +1871 2a03:2880:11ff:3::face:b00c +1885 2a03:2880:11ff:b::face:b00c +1941 2a03:2880:11ff:8::face:b00c +1942 2a03:2880:11ff:e::face:b00c +1987 2a03:2880:11ff:1::face:b00c +2023 2a03:2880:11ff:2::face:b00c +2027 2a03:2880:11ff:4::face:b00c +2032 2a03:2880:11ff:9::face:b00c +2034 2a03:2880:11ff:10::face:b00c +2050 2a03:2880:11ff:5::face:b00c +2061 2a03:2880:11ff:c::face:b00c +2076 2a03:2880:11ff:6::face:b00c +2093 2a03:2880:11ff:7::face:b00c +2107 2a03:2880:11ff::face:b00c +2118 2a03:2880:11ff:d::face:b00c +2164 2a03:2880:11ff:a::face:b00c +2178 2a03:2880:11ff:f::face:b00c +
  • - +
  • Now at least the Tomcat Crawler Session Manager Valve seems to be forcing it to re-use some Tomcat sessions:

    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
     37721
     $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04 | sort | uniq | wc -l
     15206
    -
    +
  • - +
  • I think we still need to limit more of the dynamic pages, like the “most popular” country, item, and author pages

  • + +
  • It seems these are popular too, and there is no fucking way Facebook needs that information, yet they are requesting thousands of them!

    # grep 'face:b00c' /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -c 'most-popular/'
     7033
    -
    +
  • -

    2018-11-05

    +
  • I wrote a small Python script add-dc-rights.py to add usage rights (dc.rights) to CGSpace items based on the CSV Hector gave me from MARLO:

    $ ./add-dc-rights.py -i /tmp/marlo.csv -db dspace -u dspace -p 'fuuu'
    -
    +
  • - +
  • The file marlo.csv was cleaned up and formatted in Open Refine

  • + +
  • 165 of the items in their 2017 data are from CGSpace!

  • + +
  • I will add the data to CGSpace this week (done!)

  • + +
  • Jesus, is Facebook trying to be annoying? At least the Tomcat Crawler Session Manager Valve is working to force the bot to re-use its Tomcat sessions:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "05/Nov/2018" | grep -c "2a03:2880:11ff:"
     29889
    @@ -398,11 +384,11 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11
     1057
     # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "05/Nov/2018" | grep "2a03:2880:11ff:" | grep -c -E "(handle|bitstream)"
     29896
    -
    +
  • -

    2018-11-06

    @@ -410,14 +396,13 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11 + +
  • I realized I actually only need expand=collections,subCommunities, and I wanted to see how much overhead the extra expands created so I did three runs of each:

    $ time ./rest-find-collections.py 10568/27629 --rest-url https://dspacetest.cgiar.org/rest
    -
    +
  • -

    2018-11-07

    @@ -482,55 +467,51 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11

    2018-11-19

    +
  • Testing corrections and deletions for AGROVOC (dc.subject) that Sisay and Peter were working on earlier this month:

    $ ./fix-metadata-values.py -i 2018-11-19-correct-agrovoc.csv -f dc.subject -t correct -m 57 -db dspace -u dspace -p 'fuu' -d
     $ ./delete-metadata-values.py -i 2018-11-19-delete-agrovoc.csv -f dc.subject -m 57 -db dspace -u dspace -p 'fuu' -d
    -
    +
  • - +
  • Then I ran them on both CGSpace and DSpace Test, and started a full Discovery re-index on CGSpace:

    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    -
    +
  • - +
  • Generate a new list of the top 1500 AGROVOC subjects on CGSpace to send to Peter and Sisay:

    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-11-19-top-1500-subject.csv WITH CSV HEADER;
    -
    +
  • +

    2018-11-20

    + +
  • The dspace.log.2018-11-19 shows this at the time:

    2018-11-19 15:23:04,221 ERROR com.atmire.dspace.discovery.AtmireSolrService @ DSpace kernel cannot be null
     java.lang.IllegalStateException: DSpace kernel cannot be null
    -        at org.dspace.utils.DSpace.getServiceManager(DSpace.java:63)
    -        at org.dspace.utils.DSpace.getSingletonService(DSpace.java:87)
    -        at com.atmire.dspace.discovery.AtmireSolrService.buildDocument(AtmireSolrService.java:102)
    -        at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:815)
    -        at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:884)
    -        at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370)
    -        at org.dspace.discovery.IndexClient.main(IndexClient.java:117)
    -        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -        at java.lang.reflect.Method.invoke(Method.java:498)
    -        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    -        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    +    at org.dspace.utils.DSpace.getServiceManager(DSpace.java:63)
    +    at org.dspace.utils.DSpace.getSingletonService(DSpace.java:87)
    +    at com.atmire.dspace.discovery.AtmireSolrService.buildDocument(AtmireSolrService.java:102)
    +    at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:815)
    +    at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:884)
    +    at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370)
    +    at org.dspace.discovery.IndexClient.main(IndexClient.java:117)
    +    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    +    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    +    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    +    at java.lang.reflect.Method.invoke(Method.java:498)
    +    at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    +    at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
     2018-11-19 15:23:04,223 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Processing (4629 of 76007): 72731
    -
    +
  • -

    2018-12-03

    + +
  • Even manually on my Arch Linux desktop with ghostscript 9.26-1 and the pngalpha device, I can generate a thumbnail for the first one (1056898394):

    $ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf
    -
    +
  • - +
  • So it seems to be something about the PDFs themselves, perhaps related to alpha support?

  • + +
  • The first item (1056898394) has the following information:

    $ identify Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf\[0\]
     Info Note Mainstreaming gender and social differentiation into CCAFS research activities in West Africa-converted.pdf[0]=>Info Note Mainstreaming gender and social differentiation into CCAFS research activities in West Africa-converted.pdf PDF 595x841 595x841+0+0 16-bit sRGB 107443B 0.000u 0:00.000
     identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1746.
    -
    +
  • - +
  • And wow, I can’t even run ImageMagick’s identify on the first page of the second item (1056898930):

    $ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
     zsh: abort (core dumped)  identify Food\ safety\ Kenya\ fruits.pdf\[0\]
    -
    +
  • - +
  • But with GraphicsMagick’s identify it works:

    $ gm identify Food\ safety\ Kenya\ fruits.pdf\[0\]
     DEBUG: FC_WEIGHT didn't match
     Food safety Kenya fruits.pdf PDF 612x792+0+0 DirectClass 8-bit 1.4Mi 0.000u 0m:0.000002s
    -
    +
  • - +
  • Interesting that ImageMagick’s identify does work if you do not specify a page, perhaps as alluded to in the recent Ghostscript bug report:

    $ identify Food\ safety\ Kenya\ fruits.pdf
     Food safety Kenya fruits.pdf[0] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
    @@ -251,69 +241,60 @@ Food safety Kenya fruits.pdf[2] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010
     Food safety Kenya fruits.pdf[3] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
     Food safety Kenya fruits.pdf[4] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
     identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1746.
    -
    +
  • - +
  • As I expected, ImageMagick cannot generate a thumbnail, but GraphicsMagick can (though it looks like crap):

    $ convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten Food\ safety\ Kenya\ fruits.pdf.jpg
     zsh: abort (core dumped)  convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten
     $ gm convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten Food\ safety\ Kenya\ fruits.pdf.jpg
     DEBUG: FC_WEIGHT didn't match
    -
    +
  • - +
  • I inspected the troublesome PDF using jhove and noticed that it is using ISO PDF/A-1, Level B and the other one doesn’t list a profile, though I don’t think this is relevant

  • + +
  • I found another item that fails when generating a thumbnail (1056898391, DSpace complains:

    org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
     org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
    -        at org.im4java.core.Info.getBaseInfo(Info.java:360)
    -        at org.im4java.core.Info.<init>(Info.java:151)
    -        at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:142)
    -        at org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.getDestinationStream(ImageMagickPdfThumbnailFilter.java:24)
    -        at org.dspace.app.mediafilter.FormatFilter.processBitstream(FormatFilter.java:170)
    -        at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:475)
    -        at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:429)
    -        at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:401)
    -        at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:237)
    -        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -        at java.lang.reflect.Method.invoke(Method.java:498)
    -        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    -        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    +    at org.im4java.core.Info.getBaseInfo(Info.java:360)
    +    at org.im4java.core.Info.<init>(Info.java:151)
    +    at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:142)
    +    at org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.getDestinationStream(ImageMagickPdfThumbnailFilter.java:24)
    +    at org.dspace.app.mediafilter.FormatFilter.processBitstream(FormatFilter.java:170)
    +    at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:475)
    +    at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:429)
    +    at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:401)
    +    at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:237)
    +    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    +    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    +    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    +    at java.lang.reflect.Method.invoke(Method.java:498)
    +    at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    +    at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
     Caused by: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
    -        at org.im4java.core.ImageCommand.run(ImageCommand.java:219)
    -        at org.im4java.core.Info.getBaseInfo(Info.java:342)
    -        ... 14 more
    +    at org.im4java.core.ImageCommand.run(ImageCommand.java:219)
    +    at org.im4java.core.Info.getBaseInfo(Info.java:342)
    +    ... 14 more
     Caused by: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
    -        at org.im4java.core.ImageCommand.finished(ImageCommand.java:253)
    -        at org.im4java.process.ProcessStarter.run(ProcessStarter.java:314)
    -        at org.im4java.core.ImageCommand.run(ImageCommand.java:215)
    -        ... 15 more
    -
    + at org.im4java.core.ImageCommand.finished(ImageCommand.java:253) + at org.im4java.process.ProcessStarter.run(ProcessStarter.java:314) + at org.im4java.core.ImageCommand.run(ImageCommand.java:215) + ... 15 more +
  • - +
  • And on my Arch Linux environment ImageMagick’s convert also segfaults:

    $ convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] -thumbnail x600 -flatten bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf.jpg
     zsh: abort (core dumped)  convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\]  x60
    -
    +
  • - +
  • But GraphicsMagick’s convert works:

    $ gm convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] -thumbnail x600 -flatten bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf.jpg
    -
    +
  • - +
  • So far the only thing that stands out is that the two files that don’t work were created with Microsoft Office 2016:

    $ pdfinfo bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf | grep -E '^(Creator|Producer)'
     Creator:        Microsoft® Word 2016
    @@ -321,134 +302,118 @@ Producer:       Microsoft® Word 2016
     $ pdfinfo Food\ safety\ Kenya\ fruits.pdf | grep -E '^(Creator|Producer)'
     Creator:        Microsoft® Word 2016
     Producer:       Microsoft® Word 2016
    -
    +
  • - +
  • And the one that works was created with Office 365:

    $ pdfinfo Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf | grep -E '^(Creator|Producer)'
     Creator:        Microsoft® Word for Office 365
     Producer:       Microsoft® Word for Office 365
    -
    +
  • - +
  • I remembered an old technique I was using to generate thumbnails in 2015 using Inkscape followed by ImageMagick or GraphicsMagick:

    $ inkscape Food\ safety\ Kenya\ fruits.pdf -z --export-dpi=72 --export-area-drawing --export-png='cover.png'
     $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
    -
    +
  • - + +
  • Related messages from dspace.log:

    2018-12-03 15:44:00,030 WARN  org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
     2018-12-03 15:44:03,390 ERROR com.atmire.app.webui.servlet.ExportServlet @ Error converter plugin not found: interface org.infoCon.ConverterPlugin
     ...
     2018-12-03 15:45:01,667 WARN  org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-listing-and-reports not found
    -
    +
  • + - +
  • I tested it on my local environment with Tomcat 8.5.34 and the JSPUI application still has an error (again, the logs show something about tag cloud, so be unrelated), and the Listings and Reports still asks you to log in again, despite already being logged in in XMLUI, but does appear to work (I generated a report and exported a PDF)

  • + +
  • I think the errors about missing Atmire components must be important, here on my local machine as well (though not the one about atmire-listings-and-reports):

    2018-12-03 16:44:00,009 WARN  org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
    -
    +
  • -

    2018-12-04

    +
  • Last night Linode sent a message that the load on CGSpace (linode18) was too high, here’s a list of the top users at the time and throughout the day:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018:1(5|6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    225 40.77.167.142
    -    226 66.249.64.63
    -    232 46.101.86.248
    -    285 45.5.186.2
    -    333 54.70.40.11
    -    411 193.29.13.85
    -    476 34.218.226.147
    -    962 66.249.70.27
    -   1193 35.237.175.180
    -   1450 2a01:4f8:140:3192::2
    +225 40.77.167.142
    +226 66.249.64.63
    +232 46.101.86.248
    +285 45.5.186.2
    +333 54.70.40.11
    +411 193.29.13.85
    +476 34.218.226.147
    +962 66.249.70.27
    +1193 35.237.175.180
    +1450 2a01:4f8:140:3192::2
     # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -   1141 207.46.13.57
    -   1299 197.210.168.174
    -   1341 54.70.40.11
    -   1429 40.77.167.142
    -   1528 34.218.226.147
    -   1973 66.249.70.27
    -   2079 50.116.102.77
    -   2494 78.46.79.71
    -   3210 2a01:4f8:140:3192::2
    -   4190 35.237.175.180
    -
    +1141 207.46.13.57 +1299 197.210.168.174 +1341 54.70.40.11 +1429 40.77.167.142 +1528 34.218.226.147 +1973 66.249.70.27 +2079 50.116.102.77 +2494 78.46.79.71 +3210 2a01:4f8:140:3192::2 +4190 35.237.175.180 +
  • - +
  • 35.237.175.180 is known to us (CCAFS?), and I’ve already added it to the list of bot IPs in nginx, which appears to be working:

    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03
     4772
     $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03 | sort | uniq | wc -l
     630
    -
    +
  • - +
  • I haven’t seen 2a01:4f8:140:3192::2 before. Its user agent is some new bot:

    Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
    -
    +
  • - +
  • At least it seems the Tomcat Crawler Session Manager Valve is working to re-use the common bot XMLUI sessions:

    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03
     5111
     $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03 | sort | uniq | wc -l
     419
    -
    +
  • - +
  • 78.46.79.71 is another host on Hetzner with the following user agent:

    Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
    -
    +
  • - +
  • This is not the first time a host on Hetzner has used a “normal” user agent to make thousands of requests

  • + +
  • At least it is re-using its Tomcat sessions somehow:

    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
     2044
     $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03 | sort | uniq | wc -l
     1
    -
    +
  • -

    PostgreSQL connections day

    @@ -463,43 +428,40 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03 + +
  • I looked in the logs and there’s nothing particular going on:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "05/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -   1225 157.55.39.177
    -   1240 207.46.13.12
    -   1261 207.46.13.101
    -   1411 207.46.13.157
    -   1529 34.218.226.147
    -   2085 50.116.102.77
    -   3334 2a01:7e00::f03c:91ff:fe0a:d645
    -   3733 66.249.70.27
    -   3815 35.237.175.180
    -   7669 54.70.40.11
    -
    +1225 157.55.39.177 +1240 207.46.13.12 +1261 207.46.13.101 +1411 207.46.13.157 +1529 34.218.226.147 +2085 50.116.102.77 +3334 2a01:7e00::f03c:91ff:fe0a:d645 +3733 66.249.70.27 +3815 35.237.175.180 +7669 54.70.40.11 +
  • - +
  • 54.70.40.11 is some new bot with the following user agent:

    Mozilla/5.0 (compatible) SemanticScholarBot (+https://www.semanticscholar.org/crawler)
    -
    +
  • - +
  • But Tomcat is forcing them to re-use their Tomcat sessions with the Crawler Session Manager valve:

    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
     6980
     $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05 | sort | uniq | wc -l
     1156
    -
    +
  • -

    2018-12-10

    @@ -541,32 +503,30 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05 + +
  • Looking at the nginx logs I see a few new IPs in the top 10:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "17/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    927 157.55.39.81
    -    975 54.70.40.11
    -   2090 50.116.102.77
    -   2121 66.249.66.219
    -   3811 35.237.175.180
    -   4590 205.186.128.185
    -   4590 70.32.83.92
    -   5436 2a01:4f8:173:1e85::2
    -   5438 143.233.227.216
    -   6706 94.71.244.172
    -
    +927 157.55.39.81 +975 54.70.40.11 +2090 50.116.102.77 +2121 66.249.66.219 +3811 35.237.175.180 +4590 205.186.128.185 +4590 70.32.83.92 +5436 2a01:4f8:173:1e85::2 +5438 143.233.227.216 +6706 94.71.244.172 +
  • - +
  • 94.71.244.172 and 143.233.227.216 are both in Greece and use the following user agent:

    Mozilla/3.0 (compatible; Indy Library)
    -
    +
  • -

    2018-12-18

    @@ -584,8 +544,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05

    2018-12-20

    +
  • Testing compression of PostgreSQL backups with xz and gzip:

    $ time xz -c cgspace_2018-12-19.backup > cgspace_2018-12-19.backup.xz
     xz -c cgspace_2018-12-19.backup > cgspace_2018-12-19.backup.xz  48.29s user 0.19s system 99% cpu 48.579 total
    @@ -595,43 +554,40 @@ $ ls -lh cgspace_2018-12-19.backup*
     -rw-r--r-- 1 aorth aorth 96M Dec 19 02:15 cgspace_2018-12-19.backup
     -rw-r--r-- 1 aorth aorth 94M Dec 20 11:36 cgspace_2018-12-19.backup.gz
     -rw-r--r-- 1 aorth aorth 93M Dec 20 11:35 cgspace_2018-12-19.backup.xz
    -
    +
  • - +
  • Looks like it’s really not worth it…

  • + +
  • Peter pointed out that Discovery filters for CTA subjects on item pages were not working

  • + +
  • It looks like there were some mismatches in the Discovery index names and the XMLUI configuration, so I fixed them (#406)

  • + +
  • Peter asked if we could create a controlled vocabulary for publishers (dc.publisher)

  • + +
  • I see we have about 3500 distinct publishers:

    # SELECT COUNT(DISTINCT(text_value)) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=39;
    - count
    +count
     -------
    -  3522
    +3522
     (1 row)
    -
    +
  • - +
  • I reverted the metadata changes related to “Unrestricted Access” and “Restricted Access” on DSpace Test because we’re not pushing forward with the new status terms for now

  • + +
  • Purge remaining Oracle Java 8 stuff from CGSpace (linode18) since we migrated to OpenJDK a few months ago:

    # dpkg -P oracle-java8-installer oracle-java8-set-default
    -
    +
  • - +
  • Update usage rights on CGSpace as we agreed with Maria Garruccio and Peter last month:

    $ ./fix-metadata-values.py -i /tmp/2018-11-27-update-rights.csv -f dc.rights -t correct -m 53 -db dspace -u dspace -p 'fuu' -d
     Connected to database.
     Fixed 466 occurences of: Copyrighted; Any re-use allowed
    -
    +
  • - +
  • Upgrade PostgreSQL on CGSpace (linode18) from 9.5 to 9.6:

    # apt install postgresql-9.6 postgresql-client-9.6 postgresql-contrib-9.6 postgresql-server-dev-9.6
     # pg_ctlcluster 9.5 main stop
    @@ -642,74 +598,69 @@ Fixed 466 occurences of: Copyrighted; Any re-use allowed
     # pg_upgradecluster 9.5 main
     # pg_dropcluster 9.5 main
     # dpkg -l | grep postgresql | grep 9.5 | awk '{print $2}' | xargs dpkg -r
    -
    +
  • - +
  • I’ve been running PostgreSQL 9.6 for months on my local development and public DSpace Test (linode19) environments

  • + +
  • Run all system updates on CGSpace (linode18) and restart the server

  • + +
  • Try to run the DSpace cleanup script on CGSpace (linode18), but I get some errors about foreign key constraints:

    $ dspace cleanup -v
    - - Deleting bitstream information (ID: 158227)
    - - Deleting bitstream record from database (ID: 158227)
    +- Deleting bitstream information (ID: 158227)
    +- Deleting bitstream record from database (ID: 158227)
     Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -  Detail: Key (bitstream_id)=(158227) is still referenced from table "bundle".
    +Detail: Key (bitstream_id)=(158227) is still referenced from table "bundle".
     ...
    -
    +
  • - +
  • As always, the solution is to delete those IDs manually in PostgreSQL:

    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (158227, 158251);'
     UPDATE 1
    -
    +
  • -

    2018-12-29

    + +
  • The top IP addresses as of this evening are:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "29/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    963 40.77.167.152
    -    987 35.237.175.180
    -   1062 40.77.167.55
    -   1464 66.249.66.223
    -   1660 34.218.226.147
    -   1801 70.32.83.92
    -   2005 50.116.102.77
    -   3218 66.249.66.219
    -   4608 205.186.128.185
    -   5585 54.70.40.11
    -
    +963 40.77.167.152 +987 35.237.175.180 +1062 40.77.167.55 +1464 66.249.66.223 +1660 34.218.226.147 +1801 70.32.83.92 +2005 50.116.102.77 +3218 66.249.66.219 +4608 205.186.128.185 +5585 54.70.40.11 +
  • - +
  • And just around the time of the alert:

    # zcat --force /var/log/nginx/*.log.1 /var/log/nginx/*.log.2.gz | grep -E "29/Dec/2018:1(6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    115 66.249.66.223
    -    118 207.46.13.14
    -    123 34.218.226.147
    -    133 95.108.181.88
    -    137 35.237.175.180
    -    164 66.249.66.219
    -    260 157.55.39.59
    -    291 40.77.167.55
    -    312 207.46.13.129
    -   1253 54.70.40.11
    -
    +115 66.249.66.223 +118 207.46.13.14 +123 34.218.226.147 +133 95.108.181.88 +137 35.237.175.180 +164 66.249.66.219 +260 157.55.39.59 +291 40.77.167.55 +312 207.46.13.129 +1253 54.70.40.11 +
  • - diff --git a/docs/2019-01/index.html b/docs/2019-01/index.html index c03baa9cf..fef463801 100644 --- a/docs/2019-01/index.html +++ b/docs/2019-01/index.html @@ -10,20 +10,21 @@ Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning + I don’t see anything interesting in the web server logs around that time though: - # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 - 92 40.77.167.4 - 99 210.7.29.100 - 120 38.126.157.45 - 177 35.237.175.180 - 177 40.77.167.32 - 216 66.249.75.219 - 225 18.203.76.93 - 261 46.101.86.248 - 357 207.46.13.1 - 903 54.70.40.11 + 92 40.77.167.4 + 99 210.7.29.100 +120 38.126.157.45 +177 35.237.175.180 +177 40.77.167.32 +216 66.249.75.219 +225 18.203.76.93 +261 46.101.86.248 +357 207.46.13.1 +903 54.70.40.11 + " /> @@ -36,22 +37,23 @@ I don’t see anything interesting in the web server logs around that time t Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning + I don’t see anything interesting in the web server logs around that time though: - # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 - 92 40.77.167.4 - 99 210.7.29.100 - 120 38.126.157.45 - 177 35.237.175.180 - 177 40.77.167.32 - 216 66.249.75.219 - 225 18.203.76.93 - 261 46.101.86.248 - 357 207.46.13.1 - 903 54.70.40.11 + 92 40.77.167.4 + 99 210.7.29.100 +120 38.126.157.45 +177 35.237.175.180 +177 40.77.167.32 +216 66.249.75.219 +225 18.203.76.93 +261 46.101.86.248 +357 207.46.13.1 +903 54.70.40.11 + "/> - + @@ -134,41 +136,40 @@ I don’t see anything interesting in the web server logs around that time t + +
  • I don’t see anything interesting in the web server logs around that time though:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -     92 40.77.167.4
    -     99 210.7.29.100
    -    120 38.126.157.45
    -    177 35.237.175.180
    -    177 40.77.167.32
    -    216 66.249.75.219
    -    225 18.203.76.93
    -    261 46.101.86.248
    -    357 207.46.13.1
    -    903 54.70.40.11
    -
    + 92 40.77.167.4 + 99 210.7.29.100 +120 38.126.157.45 +177 35.237.175.180 +177 40.77.167.32 +216 66.249.75.219 +225 18.203.76.93 +261 46.101.86.248 +357 207.46.13.1 +903 54.70.40.11 +
  • + +
  • Analyzing the types of requests made by the top few IPs during that time:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | grep 54.70.40.11 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c
    -     30 bitstream
    -    534 discover
    -    352 handle
    + 30 bitstream
    +534 discover
    +352 handle
     # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | grep 207.46.13.1 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c
    -    194 bitstream
    -    345 handle
    +194 bitstream
    +345 handle
     # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | grep 46.101.86.248 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c
    -    261 handle
    -
    +261 handle +
  • - +
  • It’s not clear to me what was causing the outbound traffic spike

  • + +
  • Oh nice! The once-per-year cron job for rotating the Solr statistics actually worked now (for the first time ever!):

    Moving: 81742 into core statistics-2010
     Moving: 1837285 into core statistics-2011
    @@ -179,33 +180,31 @@ Moving: 2941736 into core statistics-2015
     Moving: 5926070 into core statistics-2016
     Moving: 10562554 into core statistics-2017
     Moving: 18497180 into core statistics-2018
    -
    +
  • -

    2019-01-03

    +
  • Update local Docker image for DSpace PostgreSQL, re-using the existing data volume:

    $ sudo docker pull postgres:9.6-alpine
     $ sudo docker rm dspacedb
     $ sudo docker run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
    -
    +
  • - + +
  • The JSPUI application—which Listings and Reports depends upon—also does not load, though the error is perhaps unrelated:

    2019-01-03 14:45:21,727 INFO  org.dspace.browse.BrowseEngine @ anonymous:session_id=9471D72242DAA05BCC87734FE3C66EA6:ip_addr=127.0.0.1:browse_mini:
     2019-01-03 14:45:21,971 INFO  org.dspace.app.webui.discovery.DiscoverUtility @ facets for scope, null: 23
    @@ -214,107 +213,106 @@ $ sudo docker run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/d
     -- Parameters were:
     
     org.apache.jasper.JasperException: /home.jsp (line: [214], column: [1]) /discovery/static-tagcloud-facet.jsp (line: [57], column: [8]) No tag [tagcloud] defined in tag library imported with prefix [dspace]
    -    at org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:41)
    -    at org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.java:291)
    -    at org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.java:97)
    -    at org.apache.jasper.compiler.Parser.processIncludeDirective(Parser.java:347)
    -    at org.apache.jasper.compiler.Parser.parseIncludeDirective(Parser.java:380)
    -    at org.apache.jasper.compiler.Parser.parseDirective(Parser.java:481)
    -    at org.apache.jasper.compiler.Parser.parseElements(Parser.java:1445)
    -    at org.apache.jasper.compiler.Parser.parseBody(Parser.java:1683)
    -    at org.apache.jasper.compiler.Parser.parseOptionalBody(Parser.java:1016)
    -    at org.apache.jasper.compiler.Parser.parseCustomTag(Parser.java:1291)
    -    at org.apache.jasper.compiler.Parser.parseElements(Parser.java:1470)
    -    at org.apache.jasper.compiler.Parser.parse(Parser.java:144)
    -    at org.apache.jasper.compiler.ParserController.doParse(ParserController.java:244)
    -    at org.apache.jasper.compiler.ParserController.parse(ParserController.java:105)
    -    at org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:202)
    -    at org.apache.jasper.compiler.Compiler.compile(Compiler.java:373)
    -    at org.apache.jasper.compiler.Compiler.compile(Compiler.java:350)
    -    at org.apache.jasper.compiler.Compiler.compile(Compiler.java:334)
    -    at org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:595)
    -    at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:399)
    -    at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:386)
    -    at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:330)
    -    at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
    -    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
    -    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    -    at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
    -    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    -    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    -    at org.apache.catalina.core.ApplicationDispatcher.invoke(ApplicationDispatcher.java:728)
    -    at org.apache.catalina.core.ApplicationDispatcher.processRequest(ApplicationDispatcher.java:470)
    -    at org.apache.catalina.core.ApplicationDispatcher.doForward(ApplicationDispatcher.java:395)
    -    at org.apache.catalina.core.ApplicationDispatcher.forward(ApplicationDispatcher.java:316)
    -    at org.dspace.app.webui.util.JSPManager.showJSP(JSPManager.java:60)
    -    at org.apache.jsp.index_jsp._jspService(index_jsp.java:191)
    -    at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
    -    at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
    -    at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:476)
    -    at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:386)
    -    at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:330)
    -    at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
    -    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
    -    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    -    at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
    -    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    -    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    -    at org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:78)
    -    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    -    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    -    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:198)
    -    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96)
    -    at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:493)
    -    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:140)
    -    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:81)
    -    at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:234)
    -    at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:650)
    -    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87)
    -    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:342)
    -    at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:800)
    -    at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
    -    at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:806)
    -    at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1498)
    -    at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
    -    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    -    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    -    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    -    at java.lang.Thread.run(Thread.java:748)
    -
    +at org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:41) +at org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.java:291) +at org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.java:97) +at org.apache.jasper.compiler.Parser.processIncludeDirective(Parser.java:347) +at org.apache.jasper.compiler.Parser.parseIncludeDirective(Parser.java:380) +at org.apache.jasper.compiler.Parser.parseDirective(Parser.java:481) +at org.apache.jasper.compiler.Parser.parseElements(Parser.java:1445) +at org.apache.jasper.compiler.Parser.parseBody(Parser.java:1683) +at org.apache.jasper.compiler.Parser.parseOptionalBody(Parser.java:1016) +at org.apache.jasper.compiler.Parser.parseCustomTag(Parser.java:1291) +at org.apache.jasper.compiler.Parser.parseElements(Parser.java:1470) +at org.apache.jasper.compiler.Parser.parse(Parser.java:144) +at org.apache.jasper.compiler.ParserController.doParse(ParserController.java:244) +at org.apache.jasper.compiler.ParserController.parse(ParserController.java:105) +at org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:202) +at org.apache.jasper.compiler.Compiler.compile(Compiler.java:373) +at org.apache.jasper.compiler.Compiler.compile(Compiler.java:350) +at org.apache.jasper.compiler.Compiler.compile(Compiler.java:334) +at org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:595) +at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:399) +at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:386) +at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:330) +at javax.servlet.http.HttpServlet.service(HttpServlet.java:742) +at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231) +at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) +at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) +at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) +at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) +at org.apache.catalina.core.ApplicationDispatcher.invoke(ApplicationDispatcher.java:728) +at org.apache.catalina.core.ApplicationDispatcher.processRequest(ApplicationDispatcher.java:470) +at org.apache.catalina.core.ApplicationDispatcher.doForward(ApplicationDispatcher.java:395) +at org.apache.catalina.core.ApplicationDispatcher.forward(ApplicationDispatcher.java:316) +at org.dspace.app.webui.util.JSPManager.showJSP(JSPManager.java:60) +at org.apache.jsp.index_jsp._jspService(index_jsp.java:191) +at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70) +at javax.servlet.http.HttpServlet.service(HttpServlet.java:742) +at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:476) +at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:386) +at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:330) +at javax.servlet.http.HttpServlet.service(HttpServlet.java:742) +at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231) +at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) +at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) +at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) +at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) +at org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:78) +at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) +at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) +at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:198) +at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96) +at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:493) +at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:140) +at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:81) +at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:234) +at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:650) +at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87) +at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:342) +at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:800) +at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66) +at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:806) +at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1498) +at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) +at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) +at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) +at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) +at java.lang.Thread.run(Thread.java:748) +
  • -

    2019-01-04

    +
  • Linode sent a message last night that CGSpace (linode18) had high CPU usage, but I don’t see anything around that time in the web server logs:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Jan/2019:1(7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    189 207.46.13.192
    -    217 31.6.77.23
    -    340 66.249.70.29
    -    349 40.77.167.86
    -    417 34.218.226.147
    -    630 207.46.13.173
    -    710 35.237.175.180
    -    790 40.77.167.87
    -   1776 66.249.70.27
    -   2099 54.70.40.11
    -
    +189 207.46.13.192 +217 31.6.77.23 +340 66.249.70.29 +349 40.77.167.86 +417 34.218.226.147 +630 207.46.13.173 +710 35.237.175.180 +790 40.77.167.87 +1776 66.249.70.27 +2099 54.70.40.11 +
  • - +
  • I’m thinking about trying to validate our dc.subject terms against AGROVOC webservices

  • + +
  • There seem to be a few APIs and the documentation is kinda confusing, but I found this REST endpoint that does work well, for example searching for SOIL:

    $ http http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=SOIL&lang=en
     HTTP/1.1 200 OK
    @@ -331,40 +329,39 @@ X-Content-Type-Options: nosniff
     X-Frame-Options: ALLOW-FROM http://aims.fao.org
     
     {
    -    "@context": {
    -        "@language": "en",
    -        "altLabel": "skos:altLabel",
    -        "hiddenLabel": "skos:hiddenLabel",
    -        "isothes": "http://purl.org/iso25964/skos-thes#",
    -        "onki": "http://schema.onki.fi/onki#",
    -        "prefLabel": "skos:prefLabel",
    -        "results": {
    -            "@container": "@list",
    -            "@id": "onki:results"
    -        },
    -        "skos": "http://www.w3.org/2004/02/skos/core#",
    -        "type": "@type",
    -        "uri": "@id"
    +"@context": {
    +    "@language": "en",
    +    "altLabel": "skos:altLabel",
    +    "hiddenLabel": "skos:hiddenLabel",
    +    "isothes": "http://purl.org/iso25964/skos-thes#",
    +    "onki": "http://schema.onki.fi/onki#",
    +    "prefLabel": "skos:prefLabel",
    +    "results": {
    +        "@container": "@list",
    +        "@id": "onki:results"
         },
    -    "results": [
    -        {
    -            "lang": "en",
    -            "prefLabel": "soil",
    -            "type": [
    -                "skos:Concept"
    -            ],
    -            "uri": "http://aims.fao.org/aos/agrovoc/c_7156",
    -            "vocab": "agrovoc"
    -        }
    -    ],
    -    "uri": ""
    +    "skos": "http://www.w3.org/2004/02/skos/core#",
    +    "type": "@type",
    +    "uri": "@id"
    +},
    +"results": [
    +    {
    +        "lang": "en",
    +        "prefLabel": "soil",
    +        "type": [
    +            "skos:Concept"
    +        ],
    +        "uri": "http://aims.fao.org/aos/agrovoc/c_7156",
    +        "vocab": "agrovoc"
    +    }
    +],
    +"uri": ""
     }
    -
    +
  • - +
  • The API does not appear to be case sensitive (searches for SOIL and soil return the same thing)

  • + +
  • I’m a bit confused that there’s no obvious return code or status when a term is not found, for example SOILS:

    HTTP/1.1 200 OK
     Access-Control-Allow-Origin: *
    @@ -380,30 +377,29 @@ X-Content-Type-Options: nosniff
     X-Frame-Options: ALLOW-FROM http://aims.fao.org
     
     {
    -    "@context": {
    -        "@language": "en",
    -        "altLabel": "skos:altLabel",
    -        "hiddenLabel": "skos:hiddenLabel",
    -        "isothes": "http://purl.org/iso25964/skos-thes#",
    -        "onki": "http://schema.onki.fi/onki#",
    -        "prefLabel": "skos:prefLabel",
    -        "results": {
    -            "@container": "@list",
    -            "@id": "onki:results"
    -        },
    -        "skos": "http://www.w3.org/2004/02/skos/core#",
    -        "type": "@type",
    -        "uri": "@id"
    +"@context": {
    +    "@language": "en",
    +    "altLabel": "skos:altLabel",
    +    "hiddenLabel": "skos:hiddenLabel",
    +    "isothes": "http://purl.org/iso25964/skos-thes#",
    +    "onki": "http://schema.onki.fi/onki#",
    +    "prefLabel": "skos:prefLabel",
    +    "results": {
    +        "@container": "@list",
    +        "@id": "onki:results"
         },
    -    "results": [],
    -    "uri": ""
    +    "skos": "http://www.w3.org/2004/02/skos/core#",
    +    "type": "@type",
    +    "uri": "@id"
    +},
    +"results": [],
    +"uri": ""
     }
    -
    +
  • - +
  • I guess the results object will just be empty…

  • + +
  • Another way would be to try with SPARQL, perhaps using the Python 2.7 sparql-client:

    $ python2.7 -m virtualenv /tmp/sparql
     $ . /tmp/sparql/bin/activate
    @@ -412,16 +408,16 @@ $ ipython
     In [10]: import sparql
     In [11]: s = sparql.Service("http://agrovoc.uniroma2.it:3030/agrovoc/sparql", "utf-8", "GET")
     In [12]: statement=('PREFIX skos: <http://www.w3.org/2004/02/skos/core#> '
    -    ...: 'SELECT '
    -    ...: '?label '
    -    ...: 'WHERE { '
    -    ...: '{  ?concept  skos:altLabel ?label . } UNION {  ?concept  skos:prefLabel ?label . } '
    -    ...: 'FILTER regex(str(?label), "^fish", "i") . '
    -    ...: '} LIMIT 10')
    +...: 'SELECT '
    +...: '?label '
    +...: 'WHERE { '
    +...: '{  ?concept  skos:altLabel ?label . } UNION {  ?concept  skos:prefLabel ?label . } '
    +...: 'FILTER regex(str(?label), "^fish", "i") . '
    +...: '} LIMIT 10')
     In [13]: result = s.query(statement)
     In [14]: for row in result.fetchone():
    -   ...:     print(row)
    -   ...:
    +...:     print(row)
    +...:
     (<Literal "fish catching"@en>,)
     (<Literal "fish harvesting"@en>,)
     (<Literal "fish meat"@en>,)
    @@ -432,10 +428,9 @@ In [14]: for row in result.fetchone():
     (<Literal "fishflies"@en>,)
     (<Literal "fishery biology"@en>,)
     (<Literal "fish production"@en>,)
    -
    +
  • -

    2019-01-06

    @@ -502,32 +497,31 @@ In [14]: for row in result.fetchone():
  • We agreed to try to stick to pure Dublin Core where possible, then use fields that exist in standard DSpace, and use “cg” namespace for everything else
  • Major changes are to move dc.contributor.author to dc.creator (which MELSpace and WorldFish are already using in their DSpace repositories)
  • -
  • I am testing the speed of the WorldFish DSpace repository’s REST API and it’s five to ten times faster than CGSpace as I tested in 2018-10:
  • - + +
  • I am testing the speed of the WorldFish DSpace repository’s REST API and it’s five to ten times faster than CGSpace as I tested in 2018-10:

    $ time http --print h 'https://digitalarchive.worldfishcenter.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
     
     0.16s user 0.03s system 3% cpu 5.185 total
     0.17s user 0.02s system 2% cpu 7.123 total
     0.18s user 0.02s system 6% cpu 3.047 total
    -
    +
  • - +
  • In other news, Linode sent a mail last night that the CPU load on CGSpace (linode18) was high, here are the top IPs in the logs around those few hours:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "14/Jan/2019:(17|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    157 31.6.77.23
    -    192 54.70.40.11
    -    202 66.249.64.157
    -    207 40.77.167.204
    -    220 157.55.39.140
    -    326 197.156.105.116
    -    385 207.46.13.158
    -   1211 35.237.175.180
    -   1830 66.249.64.155
    -   2482 45.5.186.2
    -
    +157 31.6.77.23 +192 54.70.40.11 +202 66.249.64.157 +207 40.77.167.204 +220 157.55.39.140 +326 197.156.105.116 +385 207.46.13.158 +1211 35.237.175.180 +1830 66.249.64.155 +2482 45.5.186.2 +
  • +

    2019-01-16

    @@ -638,122 +632,119 @@ In [14]: for row in result.fetchone():

    Solr stats fucked up

    +
  • In the Solr admin UI I see the following error:

    statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
    -
    +
  • - +
  • Looking in the Solr log I see this:

    2019-01-16 13:37:55,395 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2018]: Error opening new searcher
     org.apache.solr.common.SolrException: Error opening new searcher
    -    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:873)
    -    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:646)
    -    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:491)
    -    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:466)
    -    at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:575)
    -    at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:199)
    -    at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188)
    -    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
    -    at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729)
    -    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:258)
    -    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
    -    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    -    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    -    at org.dspace.solr.filters.LocalHostRestrictionFilter.doFilter(LocalHostRestrictionFilter.java:50)
    -    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    -    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    -    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
    -    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
    -    at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)
    -    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
    -    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
    -    at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180)
    -    at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956)
    -    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
    -    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436)
    -    at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078)
    -    at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625)
    -    at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316)
    -    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    -    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    -    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    -    at java.lang.Thread.run(Thread.java:748)
    +at org.apache.solr.core.SolrCore.<init>(SolrCore.java:873)
    +at org.apache.solr.core.SolrCore.<init>(SolrCore.java:646)
    +at org.apache.solr.core.CoreContainer.create(CoreContainer.java:491)
    +at org.apache.solr.core.CoreContainer.create(CoreContainer.java:466)
    +at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:575)
    +at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:199)
    +at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188)
    +at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
    +at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729)
    +at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:258)
    +at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
    +at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    +at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    +at org.dspace.solr.filters.LocalHostRestrictionFilter.doFilter(LocalHostRestrictionFilter.java:50)
    +at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    +at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    +at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
    +at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
    +at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)
    +at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
    +at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
    +at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180)
    +at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956)
    +at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
    +at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436)
    +at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078)
    +at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625)
    +at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316)
    +at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    +at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    +at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    +at java.lang.Thread.run(Thread.java:748)
     Caused by: org.apache.solr.common.SolrException: Error opening new searcher
    -    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
    -    at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
    -    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:845)
    -    ... 31 more
    +at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
    +at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
    +at org.apache.solr.core.SolrCore.<init>(SolrCore.java:845)
    +... 31 more
     Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
    -    at org.apache.lucene.store.Lock.obtain(Lock.java:89)
    -    at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:753)
    -    at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77)
    -    at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
    -    at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:279)
    -    at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111)
    -    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528)
    -    ... 33 more
    +at org.apache.lucene.store.Lock.obtain(Lock.java:89)
    +at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:753)
    +at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77)
    +at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
    +at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:279)
    +at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111)
    +at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528)
    +... 33 more
     2019-01-16 13:37:55,401 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2018': Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
    -    at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:613)
    -    at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:199)
    -    at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188)
    -    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
    -    at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729)
    -    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:258)
    -    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
    -    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    -    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    -    at org.dspace.solr.filters.LocalHostRestrictionFilter.doFilter(LocalHostRestrictionFilter.java:50)
    -    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    -    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    -    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
    -    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
    -    at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)
    -    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
    -    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
    -    at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180)
    -    at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956)
    -    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
    -    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436)
    -    at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078)
    -    at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625)
    -    at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316)
    -    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    -    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    -    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    -    at java.lang.Thread.run(Thread.java:748)
    +at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:613)
    +at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:199)
    +at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188)
    +at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
    +at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729)
    +at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:258)
    +at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
    +at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    +at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    +at org.dspace.solr.filters.LocalHostRestrictionFilter.doFilter(LocalHostRestrictionFilter.java:50)
    +at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    +at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    +at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
    +at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
    +at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)
    +at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
    +at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
    +at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180)
    +at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956)
    +at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
    +at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436)
    +at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078)
    +at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625)
    +at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316)
    +at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    +at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    +at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    +at java.lang.Thread.run(Thread.java:748)
     Caused by: org.apache.solr.common.SolrException: Unable to create core [statistics-2018]
    -    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:507)
    -    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:466)
    -    at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:575)
    -    ... 27 more
    +at org.apache.solr.core.CoreContainer.create(CoreContainer.java:507)
    +at org.apache.solr.core.CoreContainer.create(CoreContainer.java:466)
    +at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:575)
    +... 27 more
     Caused by: org.apache.solr.common.SolrException: Error opening new searcher
    -    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:873)
    -    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:646)
    -    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:491)
    -    ... 29 more
    +at org.apache.solr.core.SolrCore.<init>(SolrCore.java:873)
    +at org.apache.solr.core.SolrCore.<init>(SolrCore.java:646)
    +at org.apache.solr.core.CoreContainer.create(CoreContainer.java:491)
    +... 29 more
     Caused by: org.apache.solr.common.SolrException: Error opening new searcher
    -    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
    -    at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
    -    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:845)
    -    ... 31 more
    +at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
    +at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
    +at org.apache.solr.core.SolrCore.<init>(SolrCore.java:845)
    +... 31 more
     Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
    -    at org.apache.lucene.store.Lock.obtain(Lock.java:89)
    -    at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:753)
    -    at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77)
    -    at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
    -    at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:279)
    -    at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111)
    -    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528)
    -    ... 33 more
    -
    +at org.apache.lucene.store.Lock.obtain(Lock.java:89) +at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:753) +at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77) +at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64) +at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:279) +at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111) +at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528) +... 33 more +
  • -

    Solr stats working

    @@ -768,8 +759,8 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
  • Abenet was asking if the Atmire Usage Stats are correct because they are over 2 million the last few months…
  • For 2019-01 alone the Usage Stats are already around 1.2 million
  • -
  • I tried to look in the nginx logs to see how many raw requests there are so far this month and it’s about 1.4 million:
  • - + +
  • I tried to look in the nginx logs to see how many raw requests there are so far this month and it’s about 1.4 million:

    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
     1442874
    @@ -777,7 +768,8 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
     real    0m17.161s
     user    0m16.205s
     sys     0m2.396s
    -
    +
  • +

    2019-01-17

    @@ -834,30 +826,35 @@ sys 0m2.396s

    2019-01-20

    +
  • That’s weird, I logged into DSpace Test (linode19) and it says it has been up for 213 days:

    # w
    - 04:46:14 up 213 days,  7:25,  4 users,  load average: 1.94, 1.50, 1.35
    -
    +04:46:14 up 213 days, 7:25, 4 users, load average: 1.94, 1.50, 1.35 +
  • -

    2019-01-21

    + +
  • I could either run with a simple tomcat7.service like this:

    [Unit]
     Description=Apache Tomcat 7 Web Application Container
    @@ -870,11 +867,9 @@ User=aorth
     Group=aorth
     [Install]
     WantedBy=multi-user.target
    -
    +
  • - +
  • Or try to use adapt a real systemd service like Arch Linux’s:

    [Unit]
     Description=Tomcat 7 servlet container
    @@ -892,57 +887,57 @@ Environment=ERRFILE=SYSLOG
     Environment=OUTFILE=SYSLOG
     
     ExecStart=/usr/bin/jsvc \
    -            -Dcatalina.home=${CATALINA_HOME} \
    -            -Dcatalina.base=${CATALINA_BASE} \
    -            -Djava.io.tmpdir=/var/tmp/tomcat7/temp \
    -            -cp /usr/share/java/commons-daemon.jar:/usr/share/java/eclipse-ecj.jar:${CATALINA_HOME}/bin/bootstrap.jar:${CATALINA_HOME}/bin/tomcat-juli.jar \
    -            -user tomcat7 \
    -            -java-home ${TOMCAT_JAVA_HOME} \
    -            -pidfile /var/run/tomcat7.pid \
    -            -errfile ${ERRFILE} \
    -            -outfile ${OUTFILE} \
    -            $CATALINA_OPTS \
    -            org.apache.catalina.startup.Bootstrap
    +        -Dcatalina.home=${CATALINA_HOME} \
    +        -Dcatalina.base=${CATALINA_BASE} \
    +        -Djava.io.tmpdir=/var/tmp/tomcat7/temp \
    +        -cp /usr/share/java/commons-daemon.jar:/usr/share/java/eclipse-ecj.jar:${CATALINA_HOME}/bin/bootstrap.jar:${CATALINA_HOME}/bin/tomcat-juli.jar \
    +        -user tomcat7 \
    +        -java-home ${TOMCAT_JAVA_HOME} \
    +        -pidfile /var/run/tomcat7.pid \
    +        -errfile ${ERRFILE} \
    +        -outfile ${OUTFILE} \
    +        $CATALINA_OPTS \
    +        org.apache.catalina.startup.Bootstrap
     
     ExecStop=/usr/bin/jsvc \
    -            -pidfile /var/run/tomcat7.pid \
    -            -stop \
    -            org.apache.catalina.startup.Bootstrap
    +        -pidfile /var/run/tomcat7.pid \
    +        -stop \
    +        org.apache.catalina.startup.Bootstrap
     
     [Install]
     WantedBy=multi-user.target
    -
    +
  • - +
  • I see that jsvc and libcommons-daemon-java are both available on Ubuntu so that should be easy to port

  • + +
  • We probably don’t need Eclipse Java Bytecode Compiler (ecj)

  • + +
  • I tested Tomcat 7.0.92 on Arch Linux using the tomcat7.service with jsvc and it works… nice!

  • + +
  • I think I might manage this the same way I do the restic releases in the Ansible infrastructure scripts, where I download a specific version and symlink to some generic location without the version number

  • + +
  • I verified that there is indeed an issue with sharded Solr statistics cores on DSpace, which will cause inaccurate results in the dspace-statistics-api:

    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view' | grep numFound
     <result name="response" numFound="33" start="0">
     $ http 'http://localhost:3000/solr/statistics-2018/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view' | grep numFound
     <result name="response" numFound="241" start="0">
    -
    +
  • - +
  • I opened an issue on the GitHub issue tracker (#10)

  • + +
  • I don’t think the SolrClient library we are currently using supports these type of queries so we might have to just do raw queries with requests

  • + +
  • The pysolr library says it supports multicore indexes, but I am not sure it does (or at least not with our setup):

    import pysolr
     solr = pysolr.Solr('http://localhost:3000/solr/statistics')
     results = solr.search('type:2', **{'fq': 'isBot:false AND statistics_type:view', 'facet': 'true', 'facet.field': 'id', 'facet.mincount': 1, 'facet.limit': 10, 'facet.offset': 0, 'rows': 0})
     print(results.facets['facet_fields'])
     {'id': ['77572', 646, '93185', 380, '92932', 375, '102499', 372, '101430', 337, '77632', 331, '102449', 289, '102485', 276, '100849', 270, '47080', 260]}
    -
    +
  • - +
  • If I double check one item from above, for example 77572, it appears this is only working on the current statistics core and not the shards:

    import pysolr
     solr = pysolr.Solr('http://localhost:3000/solr/statistics')
    @@ -953,20 +948,18 @@ solr = pysolr.Solr('http://localhost:3000/solr/statistics-2018/')
     results = solr.search('type:2 id:77572', **{'fq': 'isBot:false AND statistics_type:view'})
     print(results.hits)
     595
    -
    +
  • - +
  • So I guess I need to figure out how to use join queries and maybe even switch to using raw Python requests with JSON

  • + +
  • This enumerates the list of Solr cores and returns JSON format:

    http://localhost:3000/solr/admin/cores?action=STATUS&wt=json
    -
    +
  • - +
  • I think I figured out how to search across shards, I needed to give the whole URL to each other core

  • + +
  • Now I get more results when I start adding the other statistics cores:

    $ http 'http://localhost:3000/solr/statistics/select?&indent=on&rows=0&q=*:*' | grep numFound<result name="response" numFound="2061320" start="0">
     $ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/solr/statistics-2018&indent=on&rows=0&q=*:*' | grep numFound
    @@ -975,26 +968,28 @@ $ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/
     <result name="response" numFound="25606142" start="0" maxScore="1.0">
     $ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/solr/statistics-2018,localhost:8081/solr/statistics-2017,localhost:8081/solr/statistics-2016&indent=on&rows=0&q=*:*' | grep numFound
     <result name="response" numFound="31532212" start="0" maxScore="1.0">
    -
    +
  • - + +
  • For example, compare the following two queries, first including the base core and the shard in the shards parameter, and then only including the shard:

    $ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view&shards=localhost:8081/solr/statistics,localhost:8081/solr/statistics-2018' | grep numFound
     <result name="response" numFound="275" start="0" maxScore="12.205825">
     $ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view&shards=localhost:8081/solr/statistics-2018' | grep numFound
     <result name="response" numFound="241" start="0" maxScore="12.205825">
    -
    +
  • + +

    2019-01-22

    @@ -1002,37 +997,36 @@ $ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=
  • Release version 0.9.0 of the dspace-statistics-api to address the issue of querying multiple Solr statistics shards
  • I deployed it on DSpace Test (linode19) and restarted the indexer and now it shows all the stats from 2018 as well (756 pages of views, intead of 6)
  • I deployed it on CGSpace (linode18) and restarted the indexer as well
  • -
  • Linode sent an alert that CGSpace (linode18) was using high CPU this afternoon, the top ten IPs during that time were:
  • - + +
  • Linode sent an alert that CGSpace (linode18) was using high CPU this afternoon, the top ten IPs during that time were:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "22/Jan/2019:1(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    155 40.77.167.106
    -    176 2003:d5:fbda:1c00:1106:c7a0:4b17:3af8
    -    189 107.21.16.70
    -    217 54.83.93.85
    -    310 46.174.208.142
    -    346 83.103.94.48
    -    360 45.5.186.2
    -    595 154.113.73.30
    -    716 196.191.127.37
    -    915 35.237.175.180
    -
    +155 40.77.167.106 +176 2003:d5:fbda:1c00:1106:c7a0:4b17:3af8 +189 107.21.16.70 +217 54.83.93.85 +310 46.174.208.142 +346 83.103.94.48 +360 45.5.186.2 +595 154.113.73.30 +716 196.191.127.37 +915 35.237.175.180 +
  • - +
  • 35.237.175.180 is known to us

  • + +
  • I don’t think we’ve seen 196.191.127.37 before. Its user agent is:

    Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/7.0.185.1002 Safari/537.36
    -
    +
  • - +
  • Interestingly this IP is located in Addis Ababa…

  • + +
  • Another interesting one is 154.113.73.30, which is apparently at IITA Nigeria and uses the user agent:

    Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36
    -
    +
  • +

    2019-01-23

    @@ -1065,32 +1059,29 @@ $ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=
  • Maria Garruccio asked me for a list of author affiliations from all of their submitted items so she can clean them up

  • -
  • I got a list of their collections from the CGSpace XMLUI and then used an SQL query to dump the unique values to CSV:

  • - +
  • I got a list of their collections from the CGSpace XMLUI and then used an SQL query to dump the unique values to CSV:

    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/35501', '10568/41728', '10568/49622', '10568/56589', '10568/56592', '10568/65064', '10568/65718', '10568/65719', '10568/67373', '10568/67731', '10568/68235', '10568/68546', '10568/69089', '10568/69160', '10568/69419', '10568/69556', '10568/70131', '10568/70252', '10568/70978'))) group by text_value order by count desc) to /tmp/bioversity-affiliations.csv with csv;
     COPY 1109
    -
    +
  • - +
  • Send a mail to the dspace-tech mailing list about the OpenSearch issue we had with the Livestock CRP

  • + +
  • Linode sent an alert that CGSpace (linode18) had a high load this morning, here are the top ten IPs during that time:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "23/Jan/2019:0(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    222 54.226.25.74
    -    241 40.77.167.13
    -    272 46.101.86.248
    -    297 35.237.175.180
    -    332 45.5.184.72
    -    355 34.218.226.147
    -    404 66.249.64.155
    -   4637 205.186.128.185
    -   4637 70.32.83.92
    -   9265 45.5.186.2
    -
    +222 54.226.25.74 +241 40.77.167.13 +272 46.101.86.248 +297 35.237.175.180 +332 45.5.184.72 +355 34.218.226.147 +404 66.249.64.155 +4637 205.186.128.185 +4637 70.32.83.92 +9265 45.5.186.2 +
  • - +
  • Just to make sure these were not uploaded by the user or something, I manually forced the regeneration of these with DSpace’s filter-media:

    $ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace filter-media -v -f -i 10568/98390
     $ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace filter-media -v -f -i 10568/98391
    -
    +
  • -

    2019-01-24

    @@ -1128,71 +1120,72 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace fi
  • I noticed Ubuntu’s Ghostscript 9.26 works on some troublesome PDFs where Arch’s Ghostscript 9.26 doesn’t, so the fix for the first/last page crash is not the patch I found yesterday
  • Ubuntu’s Ghostscript uses another patch from Ghostscript git (upstream bug report)
  • I re-compiled Arch’s ghostscript with the patch and then I was able to generate a thumbnail from one of the troublesome PDFs
  • -
  • Before and after:
  • - + +
  • Before and after:

    $ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
     zsh: abort (core dumped)  identify Food\ safety\ Kenya\ fruits.pdf\[0\]
     $ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
     Food safety Kenya fruits.pdf[0]=>Food safety Kenya fruits.pdf PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.000u 0:00.000
     identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1747.
    -
    +
  • - +
  • I reported it to the Arch Linux bug tracker (61513)

  • + +
  • I told Atmire to go ahead with the Metadata Quality Module addition based on our 5_x-dev branch (657)

  • + +
  • Linode sent alerts last night to say that CGSpace (linode18) was using high CPU last night, here are the top ten IPs from the nginx logs around that time:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "23/Jan/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    305 3.81.136.184
    -    306 3.83.14.11
    -    306 52.54.252.47
    -    325 54.221.57.180
    -    378 66.249.64.157
    -    424 54.70.40.11
    -    497 47.29.247.74
    -    783 35.237.175.180
    -   1108 66.249.64.155
    -   2378 45.5.186.2
    -
    +305 3.81.136.184 +306 3.83.14.11 +306 52.54.252.47 +325 54.221.57.180 +378 66.249.64.157 +424 54.70.40.11 +497 47.29.247.74 +783 35.237.175.180 +1108 66.249.64.155 +2378 45.5.186.2 +
  • - +
  • 45.5.186.2 is CIAT and 66.249.64.155 is Google… hmmm.

  • + +
  • Linode sent another alert this morning, here are the top ten IPs active during that time:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "24/Jan/2019:0(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    360 3.89.134.93
    -    362 34.230.15.139
    -    366 100.24.48.177
    -    369 18.212.208.240
    -    377 3.81.136.184
    -    404 54.221.57.180
    -    506 66.249.64.155
    -   4642 70.32.83.92
    -   4643 205.186.128.185
    -   8593 45.5.186.2
    -
    +360 3.89.134.93 +362 34.230.15.139 +366 100.24.48.177 +369 18.212.208.240 +377 3.81.136.184 +404 54.221.57.180 +506 66.249.64.155 +4642 70.32.83.92 +4643 205.186.128.185 +8593 45.5.186.2 +
  • - +
  • Just double checking what CIAT is doing, they are mainly hitting the REST API:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "24/Jan/2019:" | grep 45.5.186.2 | grep -Eo "GET /(handle|bitstream|rest|oai)/" | sort | uniq -c | sort -n
    -
    +
  • -

    2019-01-25

    @@ -1212,24 +1205,22 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/

    2019-01-27

    +
  • Linode sent an email that the server was using a lot of CPU this morning, and these were the top IPs in the web server logs at the time:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "27/Jan/2019:0(6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    189 40.77.167.108
    -    191 157.55.39.2
    -    263 34.218.226.147
    -    283 45.5.184.2
    -    332 45.5.184.72
    -    608 5.9.6.51
    -    679 66.249.66.223
    -   1116 66.249.66.219
    -   4644 205.186.128.185
    -   4644 70.32.83.92
    -
    +189 40.77.167.108 +191 157.55.39.2 +263 34.218.226.147 +283 45.5.184.2 +332 45.5.184.72 +608 5.9.6.51 +679 66.249.66.223 +1116 66.249.66.219 +4644 205.186.128.185 +4644 70.32.83.92 +
  • - + +
  • Linode alerted that CGSpace (linode18) was using too much CPU again this morning, here are the active IPs from the web server log at the time:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "28/Jan/2019:0(6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -     67 207.46.13.50
    -    105 41.204.190.40
    -    117 34.218.226.147
    -    126 35.237.175.180
    -    203 213.55.99.121
    -    332 45.5.184.72
    -    377 5.9.6.51
    -    512 45.5.184.2
    -   4644 205.186.128.185
    -   4644 70.32.83.92
    -
    + 67 207.46.13.50 +105 41.204.190.40 +117 34.218.226.147 +126 35.237.175.180 +203 213.55.99.121 +332 45.5.184.72 +377 5.9.6.51 +512 45.5.184.2 +4644 205.186.128.185 +4644 70.32.83.92 +
  • - + +
  • Last night Linode sent an alert that CGSpace (linode18) was using high CPU, here are the most active IPs in the hours just before, during, and after the alert:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "28/Jan/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    310 45.5.184.2
    -    425 5.143.231.39
    -    526 54.70.40.11
    -   1003 199.47.87.141
    -   1374 35.237.175.180
    -   1455 5.9.6.51
    -   1501 66.249.66.223
    -   1771 66.249.66.219
    -   2107 199.47.87.140
    -   2540 45.5.186.2
    -
    +310 45.5.184.2 +425 5.143.231.39 +526 54.70.40.11 +1003 199.47.87.141 +1374 35.237.175.180 +1455 5.9.6.51 +1501 66.249.66.223 +1771 66.249.66.219 +2107 199.47.87.140 +2540 45.5.186.2 +
  • - +
  • Of course there is CIAT’s 45.5.186.2, but also 45.5.184.2 appears to be CIAT… I wonder why they have two harvesters?

  • + +
  • 199.47.87.140 and 199.47.87.141 is TurnItIn with the following user agent:

    TurnitinBot (https://turnitin.com/robot/crawlerinfo.html)
    -
    +
  • +

    2019-01-29

    +
  • Linode sent an alert about CGSpace (linode18) CPU usage this morning, here are the top IPs in the web server logs just before, during, and after the alert:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "29/Jan/2019:0(3|4|5|6|7)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    334 45.5.184.72
    -    429 66.249.66.223
    -    522 35.237.175.180
    -    555 34.218.226.147
    -    655 66.249.66.221
    -    844 5.9.6.51
    -   2507 66.249.66.219
    -   4645 70.32.83.92
    -   4646 205.186.128.185
    -   9329 45.5.186.2
    -
    +334 45.5.184.72 +429 66.249.66.223 +522 35.237.175.180 +555 34.218.226.147 +655 66.249.66.221 +844 5.9.6.51 +2507 66.249.66.219 +4645 70.32.83.92 +4646 205.186.128.185 +9329 45.5.186.2 +
  • - + +
  • I tried the test-email command on DSpace and it indeed is not working:

    $ dspace test-email
     
     About to send test email:
    - - To: aorth@mjanja.ch
    - - Subject: DSpace test email
    - - Server: smtp.serv.cgnet.com
    +- To: aorth@mjanja.ch
    +- Subject: DSpace test email
    +- Server: smtp.serv.cgnet.com
     
     Error sending email:
    - - Error: javax.mail.MessagingException: Could not connect to SMTP host: smtp.serv.cgnet.com, port: 25;
    -  nested exception is:
    -        java.net.ConnectException: Connection refused (Connection refused)
    +- Error: javax.mail.MessagingException: Could not connect to SMTP host: smtp.serv.cgnet.com, port: 25;
    +nested exception is:
    +    java.net.ConnectException: Connection refused (Connection refused)
     
     Please see the DSpace documentation for assistance.
    -
    +
  • -

    2019-02-08

    +
  • I re-configured CGSpace to use the email/password for cgspace-support, but I get this error when I try the test-email script:

    Error sending email:
    - - Error: com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.57 SMTP; Client was not authenticated to send anonymous mail during MAIL FROM [AM6PR10CA0028.EURPRD10.PROD.OUTLOOK.COM]
    -
    +- Error: com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.57 SMTP; Client was not authenticated to send anonymous mail during MAIL FROM [AM6PR10CA0028.EURPRD10.PROD.OUTLOOK.COM] +
  • -

    2019-02-09

    + +
  • This is just for this morning:

    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "09/Feb/2019:(07|08|09|10|11)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    289 35.237.175.180
    -    290 66.249.66.221
    -    296 18.195.78.144
    -    312 207.46.13.201
    -    393 207.46.13.64
    -    526 2a01:4f8:140:3192::2
    -    580 151.80.203.180
    -    742 5.143.231.38
    -   1046 5.9.6.51
    -   1331 66.249.66.219
    +289 35.237.175.180
    +290 66.249.66.221
    +296 18.195.78.144
    +312 207.46.13.201
    +393 207.46.13.64
    +526 2a01:4f8:140:3192::2
    +580 151.80.203.180
    +742 5.143.231.38
    +1046 5.9.6.51
    +1331 66.249.66.219
     # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "09/Feb/2019:(07|08|09|10|11)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -      4 66.249.83.30
    -      5 49.149.10.16
    -      8 207.46.13.64
    -      9 207.46.13.201
    -     11 105.63.86.154
    -     11 66.249.66.221
    -     31 66.249.66.219
    -    297 2001:41d0:d:1990::
    -    908 34.218.226.147
    -   1947 50.116.102.77
    -
    + 4 66.249.83.30 + 5 49.149.10.16 + 8 207.46.13.64 + 9 207.46.13.201 + 11 105.63.86.154 + 11 66.249.66.221 + 31 66.249.66.219 +297 2001:41d0:d:1990:: +908 34.218.226.147 +1947 50.116.102.77 +
  • - +
  • I know 66.249.66.219 is Google, 5.9.6.51 is MegaIndex, and 5.143.231.38 is SputnikBot

  • + +
  • Ooh, but 151.80.203.180 is some malicious bot making requests for /etc/passwd like this:

    /bitstream/handle/10568/68981/Identifying%20benefit%20flows%20studies%20on%20the%20potential%20monetary%20and%20non%20monetary%20benefits%20arising%20from%20the%20International%20Treaty%20on%20Plant%20Genetic_1671.pdf?sequence=1&amp;isAllowed=../etc/passwd
    -
    +
  • -

    2019-02-10

    +
  • Linode sent another alert about CGSpace (linode18) CPU load this morning, here are the top IPs in the web server XMLUI and API logs before, during, and after that time:

    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    232 18.195.78.144
    -    238 35.237.175.180
    -    281 66.249.66.221
    -    314 151.80.203.180
    -    319 34.218.226.147
    -    326 40.77.167.178
    -    352 157.55.39.149
    -    444 2a01:4f8:140:3192::2
    -   1171 5.9.6.51
    -   1196 66.249.66.219
    +232 18.195.78.144
    +238 35.237.175.180
    +281 66.249.66.221
    +314 151.80.203.180
    +319 34.218.226.147
    +326 40.77.167.178
    +352 157.55.39.149
    +444 2a01:4f8:140:3192::2
    +1171 5.9.6.51
    +1196 66.249.66.219
     # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -      6 112.203.241.69
    -      7 157.55.39.149
    -      9 40.77.167.178
    -     15 66.249.66.219
    -    368 45.5.184.72
    -    432 50.116.102.77
    -    971 34.218.226.147
    -   4403 45.5.186.2
    -   4668 205.186.128.185
    -   4668 70.32.83.92
    -
    + 6 112.203.241.69 + 7 157.55.39.149 + 9 40.77.167.178 + 15 66.249.66.219 +368 45.5.184.72 +432 50.116.102.77 +971 34.218.226.147 +4403 45.5.186.2 +4668 205.186.128.185 +4668 70.32.83.92 +
  • - +
  • Another interesting thing might be the total number of requests for web and API services during that time:

    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -cE "10/Feb/2019:0(5|6|7|8|9)"
     16333
     # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -cE "10/Feb/2019:0(5|6|7|8|9)"
     15964
    -
    +
  • - +
  • Also, the number of unique IPs served during that time:

    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq | wc -l
     1622
     # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq | wc -l
     95
    -
    +
  • - +
  • Setting it to true results in the following message when I try the dspace test-email helper on DSpace Test:

    Error sending email:
    - - Error: cannot test email because mail.server.disabled is set to true
    -
    +- Error: cannot test email because mail.server.disabled is set to true +
  • + - + +
  • I updated my local Sonatype nexus Docker image and had an issue with the volume for some reason so I decided to just start from scratch:

    # docker rm nexus
     # docker pull sonatype/nexus3
     # mkdir -p /home/aorth/.local/lib/containers/volumes/nexus_data
     # chown 200:200 /home/aorth/.local/lib/containers/volumes/nexus_data
     # docker run --name nexus --network dspace-build -d -v /home/aorth/.local/lib/containers/volumes/nexus_data:/nexus-data -p 8081:8081 sonatype/nexus3
    -
    +
  • - +
  • For some reason my mvn package for DSpace is not working now… I might go back to using Artifactory for caching instead:

    # docker pull docker.bintray.io/jfrog/artifactory-oss:latest
     # mkdir -p /home/aorth/.local/lib/containers/volumes/artifactory5_data
     # chown 1030 /home/aorth/.local/lib/containers/volumes/artifactory5_data
     # docker run --name artifactory --network dspace-build -d -v /home/aorth/.local/lib/containers/volumes/artifactory5_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss
    -
    +
  • +

    2019-02-11

    @@ -762,101 +734,94 @@ Please see the DSpace documentation for assistance. + +
  • Testing the vipsthumbnail command line tool with this CGSpace item that uses CMYK:

    $ vipsthumbnail alc_contrastes_desafios.pdf -s 300 -o '%s.jpg[Q=92,optimize_coding,strip]'
    -
    +
  • - +
  • (DSpace 5 appears to use JPEG 92 quality so I do the same)

  • + +
  • Thinking about making “top items” endpoints in my dspace-statistics-api

  • + +
  • I could use the following SQL queries very easily to get the top items by views or downloads:

    dspacestatistics=# SELECT * FROM items WHERE views > 0 ORDER BY views DESC LIMIT 10;
     dspacestatistics=# SELECT * FROM items WHERE downloads > 0 ORDER BY downloads DESC LIMIT 10;
    -
    +
  • + +
  • I’d have to think about what to make the REST API endpoints, perhaps: /statistics/top/items?limit=10

  • + +
  • But how do I do top items by views / downloads separately?

  • + +
  • I re-deployed DSpace 6.3 locally to test the PDFBox thumbnails, especially to see if they handle CMYK files properly

    +
  • The quality is JPEG 75 and I don’t see a way to set the thumbnail dimensions, but the resulting image is indeed sRGB:

    $ identify -verbose alc_contrastes_desafios.pdf.jpg
     ...
    -  Colorspace: sRGB
    -
    +Colorspace: sRGB +
  • + -

    2019-02-13

    + +
  • I even added extra mail properties to dspace.cfg as suggested by someone on the dspace-tech mailing list:

    mail.extraproperties = mail.smtp.starttls.required = true, mail.smtp.auth=true
    -
    +
  • - +
  • But the result is still:

    Error sending email:
    - - Error: com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.57 SMTP; Client was not authenticated to send anonymous mail during MAIL FROM [AM6PR06CA0001.eurprd06.prod.outlook.com]
    -
    +- Error: com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.57 SMTP; Client was not authenticated to send anonymous mail during MAIL FROM [AM6PR06CA0001.eurprd06.prod.outlook.com] +
  • - +
  • I tried to log into the Outlook 365 web mail and it doesn’t work so I’ve emailed ILRI ICT again

  • + +
  • After reading the common mistakes in the JavaMail FAQ I reconfigured the extra properties in DSpace’s mail configuration to be simply:

    mail.extraproperties = mail.smtp.starttls.enable=true
    -
    +
  • - +
  • … and then I was able to send a mail using my personal account where I know the credentials work

  • + +
  • The CGSpace account still gets this error message:

    Error sending email:
    - - Error: javax.mail.AuthenticationFailedException
    -
    +- Error: javax.mail.AuthenticationFailedException +
  • - +
  • I updated the DSpace SMTP settings in dspace.cfg as well as the variables in the DSpace role of the Ansible infrastructure scripts

  • + +
  • Thierry from CTA is having issues with his account on DSpace Test, and there is no admin password reset function on DSpace (only via email, which is disabled on DSpace Test), so I have to delete and re-create his account:

    $ dspace user --delete --email blah@cta.int
     $ dspace user --add --givenname Thierry --surname Lewyllie --email blah@cta.int --password 'blah'
    -
    +
  • - +
  • On this note, I saw a thread on the dspace-tech mailing list that says this functionality exists if you enable webui.user.assumelogin = true

  • + +
  • I will enable this on CGSpace (#411)

  • + +
  • Test re-creating my local PostgreSQL and Artifactory containers with podman instead of Docker (using the volumes from my old Docker containers though):

    # podman pull postgres:9.6-alpine
     # podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
     # podman pull docker.bintray.io/jfrog/artifactory-oss
     # podman run --name artifactory -d -v /home/aorth/.local/lib/containers/volumes/artifactory5_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss
    -
    +
  • - +
  • Totally works… awesome!

  • + +
  • Then I tried with rootless containers by creating the subuid and subgid mappings for aorth:

    $ sudo touch /etc/subuid /etc/subgid
     $ usermod --add-subuids 10000-75535 aorth
    @@ -864,12 +829,11 @@ $ usermod --add-subgids 10000-75535 aorth
     $ sudo sysctl kernel.unprivileged_userns_clone=1
     $ podman pull postgres:9.6-alpine
     $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
    -
    +
  • - +
  • Which totally works, but Podman’s rootless support doesn’t work with port mappings yet…

  • + +
  • Deploy the Tomcat-7-from-tarball branch on CGSpace (linode18), but first stop the Ubuntu Tomcat 7 and do some basic prep before running the Ansible playbook:

    # systemctl stop tomcat7
     # apt remove tomcat7 tomcat7-admin
    @@ -879,93 +843,92 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
     # chown -R dspace:dspace /home/dspace
     # chown -R dspace:dspace /home/cgspace.cgiar.org
     # dpkg -P tomcat7-admin tomcat7-common
    -
    +
  • - +
  • After running the playbook CGSpace came back up, but I had an issue with some Solr cores not being loaded (similar to last month) and this was in the Solr log:

    2019-02-14 18:17:31,304 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2018': Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
    -
    +
  • - +
  • The issue last month was address space, which is now set as LimitAS=infinity in tomcat7.service

  • + +
  • I re-ran the Ansible playbook to make sure all configs etc were the, then rebooted the server

  • + +
  • Still the error persists after reboot

  • + +
  • I will try to stop Tomcat and then remove the locks manually:

    # find /home/cgspace.cgiar.org/solr/ -iname "write.lock" -delete
    -
    +
  • -

    2019-02-15

    +
  • Tomcat was killed around 3AM by the kernel’s OOM killer according to dmesg:

    [Fri Feb 15 03:10:42 2019] Out of memory: Kill process 12027 (java) score 670 or sacrifice child
     [Fri Feb 15 03:10:42 2019] Killed process 12027 (java) total-vm:14108048kB, anon-rss:5450284kB, file-rss:0kB, shmem-rss:0kB
     [Fri Feb 15 03:10:43 2019] oom_reaper: reaped process 12027 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    -
    +
  • - +
  • The tomcat7 service shows:

    Feb 15 03:10:44 linode19 systemd[1]: tomcat7.service: Main process exited, code=killed, status=9/KILL
    -
    +
  • - +
  • I suspect it was related to the media-filter cron job that runs at 3AM but I don’t see anything particular in the log files

  • + +
  • I want to try to normalize the text_lang values to make working with metadata easier

  • + +
  • We currently have a bunch of weird values that DSpace uses like NULL, en_US, and en and others that have been entered manually by editors:

    dspace=# SELECT DISTINCT text_lang, count(*) FROM metadatavalue WHERE resource_type_id=2 GROUP BY text_lang ORDER BY count DESC;
    - text_lang |  count
    +text_lang |  count
     -----------+---------
    -           | 1069539
    - en_US     |  577110
    -           |  334768
    - en        |  133501
    - es        |      12
    - *         |      11
    - es_ES     |       2
    - fr        |       2
    - spa       |       2
    - E.        |       1
    - ethnob    |       1
    -
    + | 1069539 +en_US | 577110 + | 334768 +en | 133501 +es | 12 +* | 11 +es_ES | 2 +fr | 2 +spa | 2 +E. | 1 +ethnob | 1 +
  • - +
  • The majority are NULL, en_US, the blank string, and en—the rest are not enough to be significant

  • + +
  • Theoretically this field could help if you wanted to search for Spanish-language fields in the API or something, but even for the English fields there are two different values (and those are from DSpace itself)!

  • + +
  • I’m going to normalized these to NULL at least on DSpace Test for now:

    dspace=# UPDATE metadatavalue SET text_lang = NULL WHERE resource_type_id=2 AND text_lang IS NOT NULL;
     UPDATE 1045410
    -
    +
  • - + +
  • ILRI ICT fixed the password for the CGSpace support email account and I tested it on Outlook 365 web and DSpace and it works

  • + +
  • Re-create my local PostgreSQL container to for new PostgreSQL version and to use podman’s volumes:

    $ podman pull postgres:9.6-alpine
     $ podman volume create dspacedb_data
    @@ -976,12 +939,11 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
     $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost dspace_2019-02-11.backup
     $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
     $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
    -
    +
  • - +
  • And it’s all running without root!

  • + +
  • Then re-create my Artifactory container as well, taking into account ulimit open file requirements by Artifactory as well as the user limitations caused by rootless subuid mappings:

    $ podman volume create artifactory_data
     artifactory_data
    @@ -990,24 +952,21 @@ $ buildah unshare
     $ chown -R 1030:1030 ~/.local/share/containers/storage/volumes/artifactory_data
     $ exit
     $ podman start artifactory
    -
    +
  • -

    2019-02-17

    +
  • I ran DSpace’s cleanup task on CGSpace (linode18) and there were errors:

    $ dspace cleanup -v
     Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -  Detail: Key (bitstream_id)=(162844) is still referenced from table "bundle".
    -
    +Detail: Key (bitstream_id)=(162844) is still referenced from table "bundle". +
  • - -
  • Dump top 1500 subjects from CGSpace to try one more time to generate a list of invalid terms using my agrovoc-lookup.py script:
  • - + +
  • Dump top 1500 subjects from CGSpace to try one more time to generate a list of invalid terms using my agrovoc-lookup.py script:

    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-03-18-top-1500-subject.csv WITH CSV HEADER;
     COPY 1500
    @@ -534,174 +521,161 @@ $ sort -u 2019-03-18-top-1500-subject.csv > /tmp/1500-subjects-sorted.txt
     $ comm -13 /tmp/subjects-matched-sorted.txt /tmp/1500-subjects-sorted.txt > 2019-03-18-subjects-unmatched.txt
     $ wc -l 2019-03-18-subjects-unmatched.txt
     182 2019-03-18-subjects-unmatched.txt
    -
    +
  • - +
  • So the new total of matched terms with the updated regex is 1317 and unmatched is 183 (previous number of matched terms was 1187)

  • + +
  • Create and merge a pull request to update the controlled vocabulary for AGROVOC terms (#416)

  • + +
  • We are getting the blank page issue on CGSpace again today and I see a large number of the “SQL QueryTable Error” in the DSpace log again (last time was 2019-03-15):

    $ grep -c 'SQL QueryTable Error' dspace.log.2019-03-1[5678]
     dspace.log.2019-03-15:929
     dspace.log.2019-03-16:67
     dspace.log.2019-03-17:72
     dspace.log.2019-03-18:1038
    -
    +
  • - +
  • Though WTF, this grep seems to be giving weird inaccurate results actually, and the real number of errors is much lower if I exclude the “binary file matches” result with -I:

    $ grep -I 'SQL QueryTable Error' dspace.log.2019-03-18 | wc -l
     8
     $ grep -I 'SQL QueryTable Error' dspace.log.2019-03-{08,14,15,16,17,18} | awk -F: '{print $1}' | sort | uniq -c
    -      9 dspace.log.2019-03-08
    -     25 dspace.log.2019-03-14
    -     12 dspace.log.2019-03-15
    -     67 dspace.log.2019-03-16
    -     72 dspace.log.2019-03-17
    -      8 dspace.log.2019-03-18
    -
    + 9 dspace.log.2019-03-08 + 25 dspace.log.2019-03-14 + 12 dspace.log.2019-03-15 + 67 dspace.log.2019-03-16 + 72 dspace.log.2019-03-17 + 8 dspace.log.2019-03-18 +
  • - +
  • It seems to be something with grep doing binary matching on some log files for some reason, so I guess I need to always use -I to say binary files don’t match

  • + +
  • Anyways, the full error in DSpace’s log is:

    2019-03-18 12:26:23,331 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error - 
     java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@75eaa668 is closed.
    -        at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.checkOpen(DelegatingConnection.java:398)
    -        at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.prepareStatement(DelegatingConnection.java:279)
    -        at org.apache.tomcat.dbcp.dbcp.PoolingDataSource$PoolGuardConnectionWrapper.prepareStatement(PoolingDataSource.java:313)
    -        at org.dspace.storage.rdbms.DatabaseManager.queryTable(DatabaseManager.java:220)
    -
    + at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.checkOpen(DelegatingConnection.java:398) + at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.prepareStatement(DelegatingConnection.java:279) + at org.apache.tomcat.dbcp.dbcp.PoolingDataSource$PoolGuardConnectionWrapper.prepareStatement(PoolingDataSource.java:313) + at org.dspace.storage.rdbms.DatabaseManager.queryTable(DatabaseManager.java:220) +
  • - +
  • There is a low number of connections to PostgreSQL currently:

    $ psql -c 'select * from pg_stat_activity' | wc -l
     33
     $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    -      6 dspaceApi
    -      7 dspaceCli
    -     15 dspaceWeb
    -
    + 6 dspaceApi + 7 dspaceCli + 15 dspaceWeb +
  • - +
  • I looked in the PostgreSQL logs, but all I see are a bunch of these errors going back two months to January:

    2019-01-13 06:25:13.062 CET [9157] postgres@template1 ERROR:  column "waiting" does not exist at character 217
    -
    +
  • - +
  • This is unrelated and apparently due to Munin checking a column that was changed in PostgreSQL 9.6

  • + +
  • I suspect that this issue with the blank pages might not be PostgreSQL after all, perhaps it’s a Cocoon thing?

  • + +
  • Looking in the cocoon logs I see a large number of warnings about “Can not load requested doc” around 11AM and 12PM:

    $ grep 'Can not load requested doc' cocoon.log.2019-03-18 | grep -oE '2019-03-18 [0-9]{2}:' | sort | uniq -c
    -      2 2019-03-18 00:
    -      6 2019-03-18 02:
    -      3 2019-03-18 04:
    -      1 2019-03-18 05:
    -      1 2019-03-18 07:
    -      2 2019-03-18 08:
    -      4 2019-03-18 09:
    -      5 2019-03-18 10:
    -    863 2019-03-18 11:
    -    203 2019-03-18 12:
    -     14 2019-03-18 13:
    -      1 2019-03-18 14:
    -
    + 2 2019-03-18 00: + 6 2019-03-18 02: + 3 2019-03-18 04: + 1 2019-03-18 05: + 1 2019-03-18 07: + 2 2019-03-18 08: + 4 2019-03-18 09: + 5 2019-03-18 10: +863 2019-03-18 11: +203 2019-03-18 12: + 14 2019-03-18 13: + 1 2019-03-18 14: +
  • - +
  • And a few days ago on 2019-03-15 when I happened last it was in the afternoon when it happened and the same pattern occurs then around 1–2PM:

    $ xzgrep 'Can not load requested doc' cocoon.log.2019-03-15.xz | grep -oE '2019-03-15 [0-9]{2}:' | sort | uniq -c
    -      4 2019-03-15 01:
    -      3 2019-03-15 02:
    -      1 2019-03-15 03:
    -     13 2019-03-15 04:
    -      1 2019-03-15 05:
    -      2 2019-03-15 06:
    -      3 2019-03-15 07:
    -     27 2019-03-15 09:
    -      9 2019-03-15 10:
    -      3 2019-03-15 11:
    -      2 2019-03-15 12:
    -    531 2019-03-15 13:
    -    274 2019-03-15 14:
    -      4 2019-03-15 15:
    -     75 2019-03-15 16:
    -      5 2019-03-15 17:
    -      5 2019-03-15 18:
    -      6 2019-03-15 19:
    -      2 2019-03-15 20:
    -      4 2019-03-15 21:
    -      3 2019-03-15 22:
    -      1 2019-03-15 23:
    -
    + 4 2019-03-15 01: + 3 2019-03-15 02: + 1 2019-03-15 03: + 13 2019-03-15 04: + 1 2019-03-15 05: + 2 2019-03-15 06: + 3 2019-03-15 07: + 27 2019-03-15 09: + 9 2019-03-15 10: + 3 2019-03-15 11: + 2 2019-03-15 12: +531 2019-03-15 13: +274 2019-03-15 14: + 4 2019-03-15 15: + 75 2019-03-15 16: + 5 2019-03-15 17: + 5 2019-03-15 18: + 6 2019-03-15 19: + 2 2019-03-15 20: + 4 2019-03-15 21: + 3 2019-03-15 22: + 1 2019-03-15 23: +
  • - +
  • And again on 2019-03-08, surprise surprise, it happened in the morning:

    $ xzgrep 'Can not load requested doc' cocoon.log.2019-03-08.xz | grep -oE '2019-03-08 [0-9]{2}:' | sort | uniq -c
    -     11 2019-03-08 01:
    -      3 2019-03-08 02:
    -      1 2019-03-08 03:
    -      2 2019-03-08 04:
    -      1 2019-03-08 05:
    -      1 2019-03-08 06:
    -      1 2019-03-08 08:
    -    425 2019-03-08 09:
    -    432 2019-03-08 10:
    -    717 2019-03-08 11:
    -     59 2019-03-08 12:
    -
    + 11 2019-03-08 01: + 3 2019-03-08 02: + 1 2019-03-08 03: + 2 2019-03-08 04: + 1 2019-03-08 05: + 1 2019-03-08 06: + 1 2019-03-08 08: +425 2019-03-08 09: +432 2019-03-08 10: +717 2019-03-08 11: + 59 2019-03-08 12: +
  • -

    2019-03-19

    -

    2019-04-02

    @@ -195,94 +195,83 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace

    2019-04-03

    +
  • First I need to extract the ones that are unique from their list compared to our existing one:

    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2019-04-03-orcid-ids.txt
    -
    +
  • + - +
  • We currently have 1177 unique ORCID identifiers, and this brings our total to 1237!

  • + +
  • Next I will resolve all their names using my resolve-orcids.py script:

    $ ./resolve-orcids.py -i /tmp/2019-04-03-orcid-ids.txt -o 2019-04-03-orcid-ids.txt -d
    -
    +
  • - +
  • After that I added the XML formatting, formatted the file with tidy, and sorted the names in vim

  • + +
  • One user’s name has changed so I will update those using my fix-metadata-values.py script:

    $ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
    -
    +
  • - +
  • I created a pull request and merged the changes to the 5_x-prod branch (#417)

  • + +
  • A few days ago I noticed some weird update process for the statistics-2018 Solr core and I see it’s still going:

    2019-04-03 16:34:02,262 INFO  org.dspace.statistics.SolrLogger @ Updating : 1754500/21701 docs in http://localhost:8081/solr//statistics-2018
    -
    +
  • - +
  • Interestingly, there are 5666 occurences, and they are mostly for the 2018 core:

    $ grep 'org.dspace.statistics.SolrLogger @ Updating' /home/cgspace.cgiar.org/log/dspace.log.2019-04-03 | awk '{print $11}' | sort | uniq -c
    -      1 
    -      3 http://localhost:8081/solr//statistics-2017
    -   5662 http://localhost:8081/solr//statistics-2018
    -
    + 1 + 3 http://localhost:8081/solr//statistics-2017 +5662 http://localhost:8081/solr//statistics-2018 +
  • -

    2019-04-05

    + +
  • I see there are lots of PostgreSQL connections:

    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    -      5 dspaceApi
    -     10 dspaceCli
    -    250 dspaceWeb
    -
    + 5 dspaceApi + 10 dspaceCli +250 dspaceWeb +
  • - +
  • I still see those weird messages about updating the statistics-2018 Solr core:

    2019-04-05 21:06:53,770 INFO  org.dspace.statistics.SolrLogger @ Updating : 2444600/21697 docs in http://localhost:8081/solr//statistics-2018
    -
    +
  • -

    CPU usage week

    +
  • The load was lower on the server after reboot, but Solr didn’t come back up properly according to the Solr Admin UI:

    statistics-2017: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher 
    -
    +
  • + -

    2019-04-06

    @@ -295,63 +284,58 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
  • I tweeted the item and I assume this will link the Handle with the DOI in the system
  • Twenty minutes later I see the same Altmetric score (9) on CGSpace
  • -
  • Linode sent an alert that there was high CPU usage this morning on CGSpace (linode18) and these were the top IPs in the webserver access logs around the time:
  • - + +
  • Linode sent an alert that there was high CPU usage this morning on CGSpace (linode18) and these were the top IPs in the webserver access logs around the time:

    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    222 18.195.78.144
    -    245 207.46.13.58
    -    303 207.46.13.194
    -    328 66.249.79.33
    -    564 207.46.13.210
    -    566 66.249.79.62
    -    575 40.77.167.66
    -   1803 66.249.79.59
    -   2834 2a01:4f8:140:3192::2
    -   9623 45.5.184.72
    +222 18.195.78.144
    +245 207.46.13.58
    +303 207.46.13.194
    +328 66.249.79.33
    +564 207.46.13.210
    +566 66.249.79.62
    +575 40.77.167.66
    +1803 66.249.79.59
    +2834 2a01:4f8:140:3192::2
    +9623 45.5.184.72
     # zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -     31 66.249.79.62
    -     41 207.46.13.210
    -     42 40.77.167.66
    -     54 42.113.50.219
    -    132 66.249.79.59
    -    785 2001:41d0:d:1990::
    -   1164 45.5.184.72
    -   2014 50.116.102.77
    -   4267 45.5.186.2
    -   4893 205.186.128.185
    -
    + 31 66.249.79.62 + 41 207.46.13.210 + 42 40.77.167.66 + 54 42.113.50.219 +132 66.249.79.59 +785 2001:41d0:d:1990:: +1164 45.5.184.72 +2014 50.116.102.77 +4267 45.5.186.2 +4893 205.186.128.185 +
  • - +
  • 45.5.184.72 is in Colombia so it’s probably CIAT, and I see they are indeed trying to get crawl the Discover pages on CIAT’s datasets collection:

    GET /handle/10568/72970/discover?filtertype_0=type&filtertype_1=author&filter_relational_operator_1=contains&filter_relational_operator_0=equals&filter_1=&filter_0=Dataset&filtertype=dateIssued&filter_relational_operator=equals&filter=2014
    -
    +
  • - +
  • Their user agent is the one I added to the badbots list in nginx last week: “GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1”

  • + +
  • They made 22,000 requests to Discover on this collection today alone (and it’s only 11AM):

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "06/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c 
    -  22077 /handle/10568/72970/discover
    -
    +22077 /handle/10568/72970/discover +
  • - +
  • Yesterday they made 43,000 requests and we actually blocked most of them:

    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c 
    -  43631 /handle/10568/72970/discover
    +43631 /handle/10568/72970/discover
     # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -E '/handle/[0-9]+/[0-9]+/discover' | awk '{print $9}' | sort | uniq -c 
    -    142 200
    -  43489 503
    -
    +142 200 +43489 503 +
  • - - + +
  • I ended up using this GREL expression to copy all values to a new column:

    if(cell.recon.matched, cell.recon.match.name, value)
    -
    +
  • + - +
  • See the OpenRefine variables documentation for more notes about the recon object

  • + +
  • I also noticed a handful of errors in our current list of affiliations so I corrected them:

    $ ./fix-metadata-values.py -i 2019-04-08-fix-13-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
    -
    +
  • - +
  • We should create a new list of affiliations to update our controlled vocabulary again

  • + +
  • I dumped a list of the top 1500 affiliations:

    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 211 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-04-08-top-1500-affiliations.csv WITH CSV HEADER;
     COPY 1500
    -
    +
  • - +
  • Fix a few more messed up affiliations that have return characters in them (use Ctrl-V Ctrl-M to re-create control character):

    dspace=# UPDATE metadatavalue SET text_value='International Institute for Environment and Development' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'International Institute^M%';
     dspace=# UPDATE metadatavalue SET text_value='Kenya Agriculture and Livestock Research Organization' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'Kenya Agricultural  and Livestock  Research^M%';
    -
    +
  • - +
  • I noticed a bunch of subjects and affiliations that use stylized apostrophes so I will export those and then batch update them:

    dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE '%’%') to /tmp/2019-04-08-affiliations-apostrophes.csv WITH CSV HEADER;
     COPY 60
     dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 57 AND text_value LIKE '%’%') to /tmp/2019-04-08-subject-apostrophes.csv WITH CSV HEADER;
     COPY 20
    -
    +
  • - +
  • I cleaned them up in OpenRefine and then applied the fixes on CGSpace and DSpace Test:

    $ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-60-affiliations-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
     $ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-20-subject-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
    -
    +
  • - + +
  • I looked at PostgreSQL and see shitloads of connections there:

    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    -      5 dspaceApi
    -      7 dspaceCli
    -    250 dspaceWeb
    -
    +5 dspaceApi +7 dspaceCli +250 dspaceWeb +
  • + - +
  • On a related note I see connection pool errors in the DSpace log:

    2019-04-08 19:01:10,472 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - 
     org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-319] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
    -
    +
  • -

    CPU usage week

    + +
  • The web server logs are not very busy:

    # zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "08/Apr/2019:(17|18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    124 40.77.167.135
    -    135 95.108.181.88
    -    139 157.55.39.206
    -    190 66.249.79.133
    -    202 45.5.186.2
    -    284 207.46.13.95
    -    359 18.196.196.108
    -    457 157.55.39.164
    -    457 40.77.167.132
    -   3822 45.5.184.72
    +124 40.77.167.135
    +135 95.108.181.88
    +139 157.55.39.206
    +190 66.249.79.133
    +202 45.5.186.2
    +284 207.46.13.95
    +359 18.196.196.108
    +457 157.55.39.164
    +457 40.77.167.132
    +3822 45.5.184.72
     # zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "08/Apr/2019:(17|18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -      5 129.0.79.206
    -      5 41.205.240.21
    -      7 207.46.13.95
    -      7 66.249.79.133
    -      7 66.249.79.135
    -      7 95.108.181.88
    -      8 40.77.167.111
    -     19 157.55.39.164
    -     20 40.77.167.132
    -    370 51.254.16.223
    -
    + 5 129.0.79.206 + 5 41.205.240.21 + 7 207.46.13.95 + 7 66.249.79.133 + 7 66.249.79.135 + 7 95.108.181.88 + 8 40.77.167.111 + 19 157.55.39.164 + 20 40.77.167.132 +370 51.254.16.223 +
  • +

    2019-04-09

    + +
  • Here are the top IPs in the web server logs around that time:

    # zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "09/Apr/2019:(06|07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -     18 66.249.79.139
    -     21 157.55.39.160
    -     29 66.249.79.137
    -     38 66.249.79.135
    -     50 34.200.212.137
    -     54 66.249.79.133
    -    100 102.128.190.18
    -   1166 45.5.184.72
    -   4251 45.5.186.2
    -   4895 205.186.128.185
    + 18 66.249.79.139
    + 21 157.55.39.160
    + 29 66.249.79.137
    + 38 66.249.79.135
    + 50 34.200.212.137
    + 54 66.249.79.133
    +100 102.128.190.18
    +1166 45.5.184.72
    +4251 45.5.186.2
    +4895 205.186.128.185
     # zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "09/Apr/2019:(06|07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    200 144.48.242.108
    -    202 207.46.13.185
    -    206 18.194.46.84
    -    239 66.249.79.139
    -    246 18.196.196.108
    -    274 31.6.77.23
    -    289 66.249.79.137
    -    312 157.55.39.160
    -    441 66.249.79.135
    -    856 66.249.79.133
    -
    +200 144.48.242.108 +202 207.46.13.185 +206 18.194.46.84 +239 66.249.79.139 +246 18.196.196.108 +274 31.6.77.23 +289 66.249.79.137 +312 157.55.39.160 +441 66.249.79.135 +856 66.249.79.133 +
  • - +
  • 45.5.186.2 is at CIAT in Colombia and I see they are mostly making requests to the REST API, but also to XMLUI with the following user agent:

    Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36
    -
    +
  • - +
  • Database connection usage looks fine:

    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    -      5 dspaceApi
    -      7 dspaceCli
    -     11 dspaceWeb
    -
    + 5 dspaceApi + 7 dspaceCli + 11 dspaceWeb +
  • -

    2019-04-10

    + +
  • Note that if you use HTTPS and specify a contact address in the API request you have less likelihood of being blocked

    $ http 'https://api.crossref.org/funders?query=mercator&mailto=me@cgiar.org'
    -
    +
  • - +
  • Otherwise, they provide the funder data in CSV and RDF format

  • + +
  • I did a quick test with the recent IITA records against reconcile-csv in OpenRefine and it matched a few, but the ones that didn’t match will need a human to go and do some manual checking and informed decision making…

  • + +
  • If I want to write a script for this I could use the Python habanero library:

    from habanero import Crossref
     cr = Crossref(mailto="me@cgiar.org")
     x = cr.funders(query = "mercator")
    -
    +
  • +

    2019-04-11

    @@ -849,16 +812,16 @@ x = cr.funders(query = "mercator")
  • I validated all the AGROVOC subjects against our latest list with reconcile-csv
  • About 720 of the 900 terms were matched, then I checked and fixed or deleted the rest manually
  • -
  • I captured a few general corrections and deletions for AGROVOC subjects while looking at IITA’s records, so I applied them to DSpace Test and CGSpace:
  • - + +
  • I captured a few general corrections and deletions for AGROVOC subjects while looking at IITA’s records, so I applied them to DSpace Test and CGSpace:

    $ ./fix-metadata-values.py -i /tmp/2019-04-11-fix-14-subjects.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
     $ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d
    -
    +
  • - + +
  • I cloned the handle column and then did a transform to get the IDs from the CGSpace REST API:

    import json
     import re
    @@ -948,23 +910,24 @@ data = json.load(res)
     item_id = data['id']
     
     return item_id
    -
    +
  • + - + +
  • I ran a full Discovery indexing on CGSpace because I didn’t do it after all the metadata updates last week:

    $ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    82m45.324s
     user    7m33.446s
     sys     2m13.463s
    -
    +
  • +

    2019-04-16

    @@ -1122,8 +1085,8 @@ sys 2m13.463s + +
  • Regarding the other Linode issue about speed, I did a test with iperf between linode18 and linode19:

    # iperf -s
     ------------------------------------------------------------
    @@ -1139,16 +1102,17 @@ TCP window size: 85.0 KByte (default)
     [ ID] Interval       Transfer     Bandwidth
     [  5]  0.0-10.2 sec   172 MBytes   142 Mbits/sec
     [  4]  0.0-10.5 sec   202 MBytes   162 Mbits/sec
    -
    +
  • - + +
  • I want to get rid of this annoying warning that is constantly in our DSpace logs:

    2019-04-08 19:02:31,770 WARN  org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
    -
    +
  • - +
  • Apparently it happens once per request, which can be at least 1,500 times per day according to the DSpace logs on CGSpace (linode18):

    $ grep -c 'Falling back to request address' dspace.log.2019-04-20
     dspace.log.2019-04-20:1515
    -
    +
  • - - + +
  • He says he’s getting HTTP 401 errors when trying to search for CPWF subject terms, which I can reproduce:

    $ curl -f -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": "en_US"}'
     curl: (22) The requested URL returned error: 401
    -
    +
  • + - + +
  • The breakdown of text_lang fields used in those items is 942:

    dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='en_US';
    - count 
    +count 
     -------
    -   376
    +376
     (1 row)
     
     dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='';
    - count 
    +count 
     -------
    -   149
    +149
     (1 row)
     
     dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang IS NULL;
    - count 
    +count 
     -------
    -   417
    +417
     (1 row)
    -
    +
  • + - +
  • I see that the HTTP 401 issue seems to be a bug due to an item that the user doesn’t have permission to access… from the DSpace log:

    2019-04-24 08:11:51,129 INFO  org.dspace.rest.ItemsResource @ Looking for item with metadata(key=cg.subject.cpwf,value=WATER MANAGEMENT, language=en_US).
     2019-04-24 08:11:51,231 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/72448
     2019-04-24 08:11:51,238 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/72491
     2019-04-24 08:11:51,243 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/75703
     2019-04-24 08:11:51,252 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item!
    -
    +
  • - +
  • Nevertheless, if I request using the null language I get 1020 results, plus 179 for a blank language attribute:

    $ curl -s -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": null}' | jq length
     1020
     $ curl -s -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": ""}' | jq length
     179
    -
    +
  • - +
  • This is weird because I see 942–1156 items with “WATER MANAGEMENT” (depending on wildcard matching for errors in subject spelling):

    dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT';
    - count 
    +count 
     -------
    -   942
    +942
     (1 row)
     
     dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value LIKE '%WATER MANAGEMENT%';
    - count 
    +count 
     -------
    -  1156
    +1156
     (1 row)
    -
    +
  • -

    2019-04-25

    @@ -1337,119 +1291,112 @@ dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AN
  • Also, it would be nice if we could include the item title in the shared link
  • I created an issue on GitHub to track this (#419)
  • -
  • I tested the REST API after logging in with my super admin account and I was able to get results for the problematic query:
  • - + +
  • I tested the REST API after logging in with my super admin account and I was able to get results for the problematic query:

    $ curl -f -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/login" -d '{"email":"example@me.com","password":"fuuuuu"}'
     $ curl -f -H "Content-Type: application/json" -H "rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b" -X GET "https://dspacetest.cgiar.org/rest/status"
     $ curl -f -H "rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": "en_US"}'
    -
    +
  • - +
  • I created a normal user for Carlos to try as an unprivileged user:

    $ dspace user --add --givenname Carlos --surname Tejo --email blah@blah.com --password 'ddmmdd'
    -
    +
  • + +
  • But still I get the HTTP 401 and I have no idea which item is causing it

  • + +
  • I enabled more verbose logging in ItemsResource.java and now I can at least see the item ID that causes the failure…

    +
  • The item is not even in the archive, but somehow it is discoverable

    dspace=# SELECT * FROM item WHERE item_id=74648;
    - item_id | submitter_id | in_archive | withdrawn |       last_modified        | owning_collection | discoverable
    +item_id | submitter_id | in_archive | withdrawn |       last_modified        | owning_collection | discoverable
     ---------+--------------+------------+-----------+----------------------------+-------------------+--------------
    -   74648 |          113 | f          | f         | 2016-03-30 09:00:52.131+00 |                   | t
    +74648 |          113 | f          | f         | 2016-03-30 09:00:52.131+00 |                   | t
     (1 row)
    -
    +
  • + -

    2019-04-26

    +
  • Export a list of authors for Peter to look through:

    dspacetest=# # \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-04-26-all-authors.csv with csv header;
     COPY 65752
    -
    +
  • +

    2019-04-28

    +
  • I made the item private in the UI and then I see in the UI and PostgreSQL that it is no longer discoverable:

    dspace=# SELECT * FROM item WHERE item_id=74648;
    - item_id | submitter_id | in_archive | withdrawn |       last_modified        | owning_collection | discoverable 
    +item_id | submitter_id | in_archive | withdrawn |       last_modified        | owning_collection | discoverable 
     ---------+--------------+------------+-----------+----------------------------+-------------------+--------------
    -   74648 |          113 | f          | f         | 2019-04-28 08:48:52.114-07 |                   | f
    +74648 |          113 | f          | f         | 2019-04-28 08:48:52.114-07 |                   | f
     (1 row)
    -
    +
  • + - +
  • And I tried the curl command from above again, but I still get the HTTP 401 and and the same error in the DSpace log:

    2019-04-28 08:53:07,170 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item(id=74648)!
    -
    +
  • -

    2019-04-30

    + +
  • Delete and re-create Podman container for dspacedb after pulling a new PostgreSQL container:

    $ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
    -
    +
  • + +
  • Carlos from LandPortal asked if I could export CGSpace in a machine-readable format so I think I’ll try to do a CSV

    +
  • In order to make it easier for him to understand the CSV I will normalize the text languages (minus the provenance field) on my local development instance before exporting:

    dspace=# SELECT DISTINCT text_lang, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id != 28 GROUP BY text_lang;
    - text_lang |  count
    +text_lang |  count
     -----------+---------
    -           |  358647
    - *         |      11
    - E.        |       1
    - en        |    1635
    - en_US     |  602312
    - es        |      12
    - es_ES     |       2
    - ethnob    |       1
    - fr        |       2
    - spa       |       2
    -           | 1074345
    +   |  358647
    +*         |      11
    +E.        |       1
    +en        |    1635
    +en_US     |  602312
    +es        |      12
    +es_ES     |       2
    +ethnob    |       1
    +fr        |       2
    +spa       |       2
    +   | 1074345
     (11 rows)
     dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
     UPDATE 360295
    @@ -1457,11 +1404,12 @@ dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND
     UPDATE 1074345
     dspace=# UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('es', 'spa');
     UPDATE 14
    -
    +
  • + - + +
  • The item seems to be in a pre-submitted state, so I tried to delete it from there:

    dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
     DELETE 1
    -
    +
  • - + +
  • Then I ran the following SQL:

    dspace=# DELETE FROM metadatavalue WHERE resource_id=74648;
     dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
     dspace=# DELETE FROM item WHERE item_id=74648;
    -
    - - + +
  • Now the item is (hopefully) really gone and I can continue to troubleshoot the issue with REST API’s /items/find-by-metadata-value endpoint

    + +
  • - +
  • The DSpace log shows the item ID (because I modified the error text):

    2019-05-01 11:41:11,069 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item(id=77708)!
    -
    +
  • - +
  • I told them to use the REST API like (where 1179 is the id of the RTB journal articles collection):

    https://cgspace.cgiar.org/rest/collections/1179/items?limit=812&expand=metadata
    -
    +
  • + +

    2019-05-03

    +
  • I checked the dspace test-email script on CGSpace and they are indeed failing:

    $ dspace test-email
     
     About to send test email:
    - - To: woohoo@cgiar.org
    - - Subject: DSpace test email
    - - Server: smtp.office365.com
    +- To: woohoo@cgiar.org
    +- Subject: DSpace test email
    +- Server: smtp.office365.com
     
     Error sending email:
    - - Error: javax.mail.AuthenticationFailedException
    +- Error: javax.mail.AuthenticationFailedException
     
     Please see the DSpace documentation for assistance.
    -
    +
  • + - +

    2019-05-05

    + + + diff --git a/docs/404.html b/docs/404.html index e0f0e1b6e..9b3e8c7da 100644 --- a/docs/404.html +++ b/docs/404.html @@ -14,7 +14,7 @@ - + diff --git a/docs/categories/index.html b/docs/categories/index.html index aabd0db80..6da371a43 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -15,7 +15,7 @@ - + @@ -108,15 +108,14 @@
  • Apparently if the item is in the workflowitem table it is submitted to a workflow
  • And if it is in the workspaceitem table it is in the pre-submitted state
  • -
  • The item seems to be in a pre-submitted state, so I tried to delete it from there:
  • - + +
  • The item seems to be in a pre-submitted state, so I tried to delete it from there:

    dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
     DELETE 1
    -
    +
  • - Read more → @@ -143,27 +142,27 @@ DELETE 1 -
  • There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today + +
  • There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today

  • - +
  • I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
    -   4432 200
    -
    +4432 200 +
  • + - +
  • In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses

  • + +
  • Apply country and region corrections and deletions on DSpace Test and CGSpace:

    $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
     $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
    -
    +
  • + Read more → @@ -218,27 +217,27 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace + +
  • The top IPs before, during, and after this latest alert tonight were:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    245 207.46.13.5
    -    332 54.70.40.11
    -    385 5.143.231.38
    -    405 207.46.13.173
    -    405 207.46.13.75
    -   1117 66.249.66.219
    -   1121 35.237.175.180
    -   1546 5.9.6.51
    -   2474 45.5.186.2
    -   5490 85.25.237.71
    -
    +245 207.46.13.5 +332 54.70.40.11 +385 5.143.231.38 +405 207.46.13.173 +405 207.46.13.75 +1117 66.249.66.219 +1121 35.237.175.180 +1546 5.9.6.51 +2474 45.5.186.2 +5490 85.25.237.71 +
  • - +
  • 85.25.237.71 is the “Linguee Bot” that I first saw last month

  • + +
  • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase

  • + +
  • There were just over 3 million accesses in the nginx logs last month:

    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
     3018243
    @@ -246,7 +245,8 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     real    0m19.873s
     user    0m22.203s
     sys     0m1.979s
    -
    +
  • + Read more → @@ -268,21 +268,22 @@ sys 0m1.979s + +
  • I don’t see anything interesting in the web server logs around that time though:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -     92 40.77.167.4
    -     99 210.7.29.100
    -    120 38.126.157.45
    -    177 35.237.175.180
    -    177 40.77.167.32
    -    216 66.249.75.219
    -    225 18.203.76.93
    -    261 46.101.86.248
    -    357 207.46.13.1
    -    903 54.70.40.11
    -
    + 92 40.77.167.4 + 99 210.7.29.100 +120 38.126.157.45 +177 35.237.175.180 +177 40.77.167.32 +216 66.249.75.219 +225 18.203.76.93 +261 46.101.86.248 +357 207.46.13.1 +903 54.70.40.11 +
  • + Read more → @@ -411,21 +412,24 @@ sys 0m1.979s

    2018-08-01

    +
  • DSpace Test had crashed at some point yesterday morning and I see the following in dmesg:

    [Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
     [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
     [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    -
    +
  • - Read more → diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index 4a666f766..24437e6f9 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -15,7 +15,7 @@ - + diff --git a/docs/categories/page/2/index.html b/docs/categories/page/2/index.html index 9c3181c08..88cce1be9 100644 --- a/docs/categories/page/2/index.html +++ b/docs/categories/page/2/index.html @@ -15,7 +15,7 @@ - + @@ -101,18 +101,16 @@

    2018-07-01

    +
  • I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:

    $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
    -
    +
  • - +
  • During the mvn package stage on the 5.8 branch I kept getting issues with java running out of memory:

    There is insufficient memory for the Java Runtime Environment to continue.
    -
    +
  • + Read more → @@ -139,23 +137,23 @@
  • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn’t build
  • I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
  • -
  • I proofed and tested the ILRI author corrections that Peter sent back to me this week:
  • - + +
  • I proofed and tested the ILRI author corrections that Peter sent back to me this week:

    $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
    -
    +
  • - +
  • I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018

  • + +
  • Time to index ~70,000 items on CGSpace:

    $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
     
     real    74m42.646s
     user    8m5.056s
     sys     2m7.289s
    -
    +
  • + Read more → @@ -279,26 +277,25 @@ sys 2m7.289s
  • I didn’t get any load alerts from Linode and the REST and XMLUI logs don’t show anything out of the ordinary
  • The nginx logs show HTTP 200s until 02/Jan/2018:11:27:17 +0000 when Uptime Robot got an HTTP 500
  • In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
  • -
  • And just before that I see this:
  • - + +
  • And just before that I see this:

    Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
    -
    +
  • - +
  • Ah hah! So the pool was actually empty!

  • + +
  • I need to increase that, let’s try to bump it up from 50 to 75

  • + +
  • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw

  • + +
  • I notice this error quite a few times in dspace.log:

    2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
     org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.
    -
    +
  • - +
  • And there are many of these errors every day for the past month:

    $ grep -c "Error while searching for sidebar facets" dspace.log.*
     dspace.log.2017-11-21:4
    @@ -344,10 +341,9 @@ dspace.log.2017-12-30:89
     dspace.log.2017-12-31:53
     dspace.log.2018-01-01:45
     dspace.log.2018-01-02:34
    -
    +
  • - Read more → @@ -400,20 +396,18 @@ dspace.log.2018-01-02:34

    2017-11-02

    +
  • Today there have been no hits by CORE and no alerts from Linode (coincidence?)

    # grep -c "CORE" /var/log/nginx/access.log
     0
    -
    +
  • - +
  • Generate list of authors on CGSpace for Peter to go through and correct:

    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
     COPY 54701
    -
    +
  • + Read more → @@ -434,15 +428,14 @@ COPY 54701

    2017-10-01

    +
  • Peter emailed to point out that many items in the ILRI archive collection have multiple handles:

    http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
    -
    +
  • - Read more → diff --git a/docs/categories/page/3/index.html b/docs/categories/page/3/index.html index 355e26fa3..1b0a6833b 100644 --- a/docs/categories/page/3/index.html +++ b/docs/categories/page/3/index.html @@ -15,7 +15,7 @@ - + @@ -261,11 +261,12 @@ + +
  • Testing the CMYK patch on a collection with 650 items:

    $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
    -
    +
  • + Read more → @@ -300,12 +301,13 @@
  • Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
  • Filed an issue on DSpace issue tracker for the filter-media bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
  • Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
  • -
  • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 1056851999):
  • - + +
  • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 1056851999):

    $ identify ~/Desktop/alc_contrastes_desafios.jpg
     /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
    -
    +
  • + Read more → @@ -326,23 +328,22 @@

    2017-02-07

    +
  • An item was mapped twice erroneously again, so I had to remove one of the mappings manually:

    dspace=# select * from collection2item where item_id = '80278';
    -  id   | collection_id | item_id
    +id   | collection_id | item_id
     -------+---------------+---------
    - 92551 |           313 |   80278
    - 92550 |           313 |   80278
    - 90774 |          1051 |   80278
    +92551 |           313 |   80278
    +92550 |           313 |   80278
    +90774 |          1051 |   80278
     (3 rows)
     dspace=# delete from collection2item where id = 92551 and item_id = 80278;
     DELETE 1
    -
    +
  • - Read more → diff --git a/docs/categories/page/4/index.html b/docs/categories/page/4/index.html index 1edb196b1..83d302248 100644 --- a/docs/categories/page/4/index.html +++ b/docs/categories/page/4/index.html @@ -15,7 +15,7 @@ - + @@ -102,20 +102,21 @@ + +
  • While looking in the logs for errors, I see tons of warnings about Atmire MQM:

    2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
    -
    +
  • - Read more → @@ -168,11 +169,12 @@
  • ORCIDs only
  • ORCIDs plus normal authors
  • -
  • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
  • - + +
  • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:

    0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
    -
    +
  • + Read more → @@ -196,11 +198,12 @@
  • Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors
  • Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace
  • We had been using DC=ILRI to determine whether a user was ILRI or not
  • -
  • It looks like we might be able to use OUs now, instead of DCs:
  • - + +
  • It looks like we might be able to use OUs now, instead of DCs:

    $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
    -
    +
  • + Read more → @@ -226,13 +229,14 @@
  • Bootstrap is at 3.3.0 but upstream is at 3.3.7, and upgrading to anything beyond 3.3.1 breaks glyphicons and probably more
  • bower stuff is a dead end, waste of time, too many issues
  • Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of fonts)
  • -
  • Start working on DSpace 5.1 → 5.5 port:
  • - + +
  • Start working on DSpace 5.1 → 5.5 port:

    $ git checkout -b 55new 5_x-prod
     $ git reset --hard ilri/5_x-prod
     $ git rebase -i dspace-5.5
    -
    +
  • + Read more → @@ -254,19 +258,18 @@ $ git rebase -i dspace-5.5 + +
  • I think this query should find and replace all authors that have “,” at the end of their names:

    dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
     UPDATE 95
     dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
    - text_value
    +text_value
     ------------
     (0 rows)
    -
    +
  • - Read more → @@ -317,12 +320,13 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and + +
  • There are 3,000 IPs accessing the REST API in a 24-hour period!

    # awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
     3168
    -
    +
  • + Read more → diff --git a/docs/categories/page/5/index.html b/docs/categories/page/5/index.html index 2035f7181..992995cb8 100644 --- a/docs/categories/page/5/index.html +++ b/docs/categories/page/5/index.html @@ -15,7 +15,7 @@ - + @@ -156,15 +156,15 @@

    2015-12-02

    +
  • Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less space:

    # cd /home/dspacetest.cgiar.org/log
     # ls -lh dspace.log.2015-11-18*
     -rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
     -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
     -rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz
    -
    +
  • + Read more → @@ -187,12 +187,13 @@ + +
  • Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:

    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
     78
    -
    +
  • + Read more → diff --git a/docs/cgiar-library-migration/index.html b/docs/cgiar-library-migration/index.html index 031992b4e..0b6ea633b 100644 --- a/docs/cgiar-library-migration/index.html +++ b/docs/cgiar-library-migration/index.html @@ -15,7 +15,7 @@ - + @@ -25,7 +25,7 @@ "@type": "BlogPosting", "headline": "CGIAR Library Migration", "url": "https:\/\/alanorth.github.io\/cgspace-notes\/cgiar-library-migration\/", - "wordCount": "1278", + "wordCount": "1285", "datePublished": "2017-09-18T16:38:35\x2b03:00", "dateModified": "2018-03-09T22:10:33\x2b02:00", "author": { @@ -121,8 +121,8 @@
  • SELECT * FROM pg_stat_activity; seems to show ~6 extra connections used by the command line tools during import
  • -
  • - + +
  • [x] Copy HTTPS certificate key pair from CGIAR Library server’s Tomcat keystore:

    $ keytool -list -keystore tomcat.keystore
     $ keytool -importkeystore -srckeystore tomcat.keystore -destkeystore library.cgiar.org.p12 -deststoretype PKCS12 -srcalias tomcat
    @@ -130,7 +130,8 @@ $ openssl pkcs12 -in library.cgiar.org.p12 -nokeys -out library.cgiar.org.crt.pe
     $ openssl pkcs12 -in library.cgiar.org.p12 -nodes -nocerts -out library.cgiar.org.key.pem
     $ wget https://certs.godaddy.com/repository/gdroot-g2.crt https://certs.godaddy.com/repository/gdig2.crt.pem
     $ cat library.cgiar.org.crt.pem gdig2.crt.pem > library.cgiar.org-chained.pem
    -
    +
  • +

    Migration Process

    @@ -155,16 +156,14 @@ $ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10947/1 10947-1/10947-1.zip + +
  • [x] Add ingestion overrides to dspace.cfg before import:

    mets.dspaceAIP.ingest.crosswalk.METSRIGHTS = NIL
     mets.dspaceAIP.ingest.crosswalk.DSPACE-ROLES = NIL
    -
    +
  • - +
  • [x] Import communities and collections, paying attention to options to skip missing parents and ignore handles:

    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ export PATH=$PATH:/home/cgspace.cgiar.org/bin
    @@ -182,36 +181,37 @@ $ for item in 10947-2527/ITEM@10947-*; do dspace packager -r -f -u -t AIP -e aor
     $ dspace packager -s -t AIP -o ignoreHandle=false -e aorth@mjanja.ch -p 10568/83389 10947-1/10947-1.zip
     $ for collection in 10947-1/COLLECTION@10947-*; do dspace packager -s -o ignoreHandle=false -t AIP -e aorth@mjanja.ch -p 10947/1 $collection; done
     $ for item in 10947-1/ITEM@10947-*; do dspace packager -r -f -u -t AIP -e aorth@mjanja.ch $item; done
    -
    +
  • +

    This submits AIP hierarchies recursively (-r) and suppresses errors when an item’s parent collection hasn’t been created yet—for example, if the item is mapped. The large historic archive (109471) is created in several steps because it requires a lot of memory and often crashes.

    Create new subcommunities and collections for content we reorganized into new hierarchies from the original:

    + +
  • Import collection hierarchy first and then the items:

    $ dspace packager -r -t AIP -o ignoreHandle=false -e aorth@mjanja.ch -p 10568/83536 10568-93760/COLLECTION@10947-4651.zip
     $ for item in 10568-93760/ITEM@10947-465*; do dspace packager -r -f -u -t AIP -e aorth@mjanja.ch $item; done
    -
    +
  • + - + +
  • Import items to collection individually in replace mode (-r) while explicitly preserving handles and ignoring parents:

    $ for item in 10568-93759/ITEM@10947-46*; do dspace packager -r -t AIP -o ignoreHandle=false -o ignoreParent=true -e aorth@mjanja.ch -p 10568/83538 $item; done
    -
    +
  • + +

    Get the handles for the last few items from CGIAR Library that were created since we did the migration to DSpace Test in May:

    @@ -219,18 +219,16 @@ $ for item in 10568-93760/ITEM@10947-465*; do dspace packager -r -f -u -t AIP -e +
  • Export them from the CGIAR Library:

    # for handle in 10947/4658 10947/4659 10947/4660 10947/4661 10947/4665 10947/4664 10947/4666 10947/4669; do /usr/local/dspace/bin/dspace packager -d -a -t AIP -e m.marus@cgiar.org -i $handle ${handle}.zip; done
    -
    +
  • - +
  • Import on CGSpace:

    $ for item in 10947-latest/*.zip; do dspace packager -r -u -t AIP -e aorth@mjanja.ch $item; done
    -
    +
  • +

    Post Migration

    @@ -239,8 +237,8 @@ $ for item in 10568-93760/ITEM@10947-465*; do dspace packager -r -f -u -t AIP -e
  • -
  • - + +
  • [x] Adjust CGSpace’s handle-server/config.dct to add the new prefix alongside our existing 10568, ie:

    "server_admins" = (
     "300:0.NA/10568"
    @@ -256,7 +254,8 @@ $ for item in 10568-93760/ITEM@10947-465*; do dspace packager -r -f -u -t AIP -e
     "300:0.NA/10568"
     "300:0.NA/10947"
     )
    -
    +
  • +

    I had been regenerated the sitebndl.zip file on the CGIAR Library server and sent it to the Handle.net admins but they said that there were mismatches between the public and private keys, which I suspect is due to make-handle-config not being very flexible. After discussing our scenario with the Handle.net admins they said we actually don’t need to send an updated sitebndl.zip for this type of change, and the above config.dct edits are all that is required. I guess they just did something on their end by setting the authoritative IP address for the 10947 prefix to be the same as ours…

    @@ -269,13 +268,14 @@ $ for item in 10568-93760/ITEM@10947-465*; do dspace packager -r -f -u -t AIP -e
  • -
  • - + +
  • [x] Switch to Let’s Encrypt HTTPS certificates (after DNS is updated and server isn’t busy):

    $ sudo systemctl stop nginx
     $ /opt/certbot-auto certonly --standalone -d library.cgiar.org
     $ sudo systemctl start nginx
    -
    +
  • +

    Troubleshooting

    diff --git a/docs/index.html b/docs/index.html index 8e4f10b02..0340eb6ea 100644 --- a/docs/index.html +++ b/docs/index.html @@ -15,7 +15,7 @@ - + @@ -108,15 +108,14 @@
  • Apparently if the item is in the workflowitem table it is submitted to a workflow
  • And if it is in the workspaceitem table it is in the pre-submitted state
  • -
  • The item seems to be in a pre-submitted state, so I tried to delete it from there:
  • - + +
  • The item seems to be in a pre-submitted state, so I tried to delete it from there:

    dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
     DELETE 1
    -
    +
  • - Read more → @@ -143,27 +142,27 @@ DELETE 1 -
  • There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today + +
  • There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today

  • - +
  • I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
    -   4432 200
    -
    +4432 200 +
  • + - +
  • In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses

  • + +
  • Apply country and region corrections and deletions on DSpace Test and CGSpace:

    $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
     $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
    -
    +
  • + Read more → @@ -218,27 +217,27 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace + +
  • The top IPs before, during, and after this latest alert tonight were:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    245 207.46.13.5
    -    332 54.70.40.11
    -    385 5.143.231.38
    -    405 207.46.13.173
    -    405 207.46.13.75
    -   1117 66.249.66.219
    -   1121 35.237.175.180
    -   1546 5.9.6.51
    -   2474 45.5.186.2
    -   5490 85.25.237.71
    -
    +245 207.46.13.5 +332 54.70.40.11 +385 5.143.231.38 +405 207.46.13.173 +405 207.46.13.75 +1117 66.249.66.219 +1121 35.237.175.180 +1546 5.9.6.51 +2474 45.5.186.2 +5490 85.25.237.71 +
  • - +
  • 85.25.237.71 is the “Linguee Bot” that I first saw last month

  • + +
  • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase

  • + +
  • There were just over 3 million accesses in the nginx logs last month:

    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
     3018243
    @@ -246,7 +245,8 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     real    0m19.873s
     user    0m22.203s
     sys     0m1.979s
    -
    +
  • + Read more → @@ -268,21 +268,22 @@ sys 0m1.979s + +
  • I don’t see anything interesting in the web server logs around that time though:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -     92 40.77.167.4
    -     99 210.7.29.100
    -    120 38.126.157.45
    -    177 35.237.175.180
    -    177 40.77.167.32
    -    216 66.249.75.219
    -    225 18.203.76.93
    -    261 46.101.86.248
    -    357 207.46.13.1
    -    903 54.70.40.11
    -
    + 92 40.77.167.4 + 99 210.7.29.100 +120 38.126.157.45 +177 35.237.175.180 +177 40.77.167.32 +216 66.249.75.219 +225 18.203.76.93 +261 46.101.86.248 +357 207.46.13.1 +903 54.70.40.11 +
  • + Read more → @@ -411,21 +412,24 @@ sys 0m1.979s

    2018-08-01

    +
  • DSpace Test had crashed at some point yesterday morning and I see the following in dmesg:

    [Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
     [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
     [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    -
    +
  • - Read more → diff --git a/docs/index.xml b/docs/index.xml index 06625a155..0a356faf4 100644 --- a/docs/index.xml +++ b/docs/index.xml @@ -27,15 +27,14 @@ <li>Apparently if the item is in the <code>workflowitem</code> table it is submitted to a workflow</li> <li>And if it is in the <code>workspaceitem</code> table it is in the pre-submitted state</li> </ul></li> -<li>The item seems to be in a pre-submitted state, so I tried to delete it from there:</li> -</ul> + +<li><p>The item seems to be in a pre-submitted state, so I tried to delete it from there:</p> <pre><code>dspace=# DELETE FROM workspaceitem WHERE item_id=74648; DELETE 1 -</code></pre> +</code></pre></li> -<ul> -<li>But after this I tried to delete the item from the XMLUI and it is <em>still</em> present&hellip;</li> +<li><p>But after this I tried to delete the item from the XMLUI and it is <em>still</em> present&hellip;</p></li> </ul> @@ -53,27 +52,27 @@ DELETE 1 <ul> <li>They asked if we had plans to enable RDF support in CGSpace</li> </ul></li> -<li>There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today + +<li><p>There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today</p> <ul> -<li>I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!</li> -</ul></li> -</ul> +<li><p>I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!</p> <pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5 - 4432 200 -</code></pre> +4432 200 +</code></pre></li> +</ul></li> -<ul> -<li>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</li> -<li>Apply country and region corrections and deletions on DSpace Test and CGSpace:</li> -</ul> +<li><p>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</p></li> + +<li><p>Apply country and region corrections and deletions on DSpace Test and CGSpace:</p> <pre><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d -</code></pre> +</code></pre></li> +</ul> @@ -110,27 +109,27 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace <ul> <li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li> -<li>The top IPs before, during, and after this latest alert tonight were:</li> -</ul> + +<li><p>The top IPs before, during, and after this latest alert tonight were:</p> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 - 245 207.46.13.5 - 332 54.70.40.11 - 385 5.143.231.38 - 405 207.46.13.173 - 405 207.46.13.75 - 1117 66.249.66.219 - 1121 35.237.175.180 - 1546 5.9.6.51 - 2474 45.5.186.2 - 5490 85.25.237.71 -</code></pre> +245 207.46.13.5 +332 54.70.40.11 +385 5.143.231.38 +405 207.46.13.173 +405 207.46.13.75 +1117 66.249.66.219 +1121 35.237.175.180 +1546 5.9.6.51 +2474 45.5.186.2 +5490 85.25.237.71 +</code></pre></li> -<ul> -<li><code>85.25.237.71</code> is the &ldquo;Linguee Bot&rdquo; that I first saw last month</li> -<li>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</li> -<li>There were just over 3 million accesses in the nginx logs last month:</li> -</ul> +<li><p><code>85.25.237.71</code> is the &ldquo;Linguee Bot&rdquo; that I first saw last month</p></li> + +<li><p>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</p></li> + +<li><p>There were just over 3 million accesses in the nginx logs last month:</p> <pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot; 3018243 @@ -138,7 +137,8 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace real 0m19.873s user 0m22.203s sys 0m1.979s -</code></pre> +</code></pre></li> +</ul> @@ -151,21 +151,22 @@ sys 0m1.979s <ul> <li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li> -<li>I don&rsquo;t see anything interesting in the web server logs around that time though:</li> -</ul> + +<li><p>I don&rsquo;t see anything interesting in the web server logs around that time though:</p> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 - 92 40.77.167.4 - 99 210.7.29.100 - 120 38.126.157.45 - 177 35.237.175.180 - 177 40.77.167.32 - 216 66.249.75.219 - 225 18.203.76.93 - 261 46.101.86.248 - 357 207.46.13.1 - 903 54.70.40.11 -</code></pre> + 92 40.77.167.4 + 99 210.7.29.100 +120 38.126.157.45 +177 35.237.175.180 +177 40.77.167.32 +216 66.249.75.219 +225 18.203.76.93 +261 46.101.86.248 +357 207.46.13.1 +903 54.70.40.11 +</code></pre></li> +</ul> @@ -249,21 +250,24 @@ sys 0m1.979s <h2 id="2018-08-01">2018-08-01</h2> <ul> -<li>DSpace Test had crashed at some point yesterday morning and I see the following in <code>dmesg</code>:</li> -</ul> +<li><p>DSpace Test had crashed at some point yesterday morning and I see the following in <code>dmesg</code>:</p> <pre><code>[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB -</code></pre> +</code></pre></li> -<ul> -<li>Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight</li> -<li>From the DSpace log I see that eventually Solr stopped responding, so I guess the <code>java</code> process that was OOM killed above was Tomcat&rsquo;s</li> -<li>I&rsquo;m not sure why Tomcat didn&rsquo;t crash with an OutOfMemoryError&hellip;</li> -<li>Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core</li> -<li>The server only has 8GB of RAM so we&rsquo;ll eventually need to upgrade to a larger one because we&rsquo;ll start starving the OS, PostgreSQL, and command line batch processes</li> -<li>I ran all system updates on DSpace Test and rebooted it</li> +<li><p>Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight</p></li> + +<li><p>From the DSpace log I see that eventually Solr stopped responding, so I guess the <code>java</code> process that was OOM killed above was Tomcat&rsquo;s</p></li> + +<li><p>I&rsquo;m not sure why Tomcat didn&rsquo;t crash with an OutOfMemoryError&hellip;</p></li> + +<li><p>Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core</p></li> + +<li><p>The server only has 8GB of RAM so we&rsquo;ll eventually need to upgrade to a larger one because we&rsquo;ll start starving the OS, PostgreSQL, and command line batch processes</p></li> + +<li><p>I ran all system updates on DSpace Test and rebooted it</p></li> </ul> @@ -276,18 +280,16 @@ sys 0m1.979s <h2 id="2018-07-01">2018-07-01</h2> <ul> -<li>I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:</li> -</ul> +<li><p>I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:</p> <pre><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace -</code></pre> +</code></pre></li> -<ul> -<li>During the <code>mvn package</code> stage on the 5.8 branch I kept getting issues with java running out of memory:</li> -</ul> +<li><p>During the <code>mvn package</code> stage on the 5.8 branch I kept getting issues with java running out of memory:</p> <pre><code>There is insufficient memory for the Java Runtime Environment to continue. -</code></pre> +</code></pre></li> +</ul> @@ -305,23 +307,23 @@ sys 0m1.979s <li>There seems to be a problem with the CUA and L&amp;R versions in <code>pom.xml</code> because they are using SNAPSHOT and it doesn&rsquo;t build</li> </ul></li> <li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li> -<li>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</li> -</ul> + +<li><p>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</p> <pre><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n -</code></pre> +</code></pre></li> -<ul> -<li>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-03/">March, 2018</a></li> -<li>Time to index ~70,000 items on CGSpace:</li> -</ul> +<li><p>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-03/">March, 2018</a></p></li> + +<li><p>Time to index ~70,000 items on CGSpace:</p> <pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b real 74m42.646s user 8m5.056s sys 2m7.289s -</code></pre> +</code></pre></li> +</ul> @@ -400,26 +402,25 @@ sys 2m7.289s <li>I didn&rsquo;t get any load alerts from Linode and the REST and XMLUI logs don&rsquo;t show anything out of the ordinary</li> <li>The nginx logs show HTTP 200s until <code>02/Jan/2018:11:27:17 +0000</code> when Uptime Robot got an HTTP 500</li> <li>In dspace.log around that time I see many errors like &ldquo;Client closed the connection before file download was complete&rdquo;</li> -<li>And just before that I see this:</li> -</ul> + +<li><p>And just before that I see this:</p> <pre><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000]. -</code></pre> +</code></pre></li> -<ul> -<li>Ah hah! So the pool was actually empty!</li> -<li>I need to increase that, let&rsquo;s try to bump it up from 50 to 75</li> -<li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&rsquo;t know what the hell Uptime Robot saw</li> -<li>I notice this error quite a few times in dspace.log:</li> -</ul> +<li><p>Ah hah! So the pool was actually empty!</p></li> + +<li><p>I need to increase that, let&rsquo;s try to bump it up from 50 to 75</p></li> + +<li><p>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&rsquo;t know what the hell Uptime Robot saw</p></li> + +<li><p>I notice this error quite a few times in dspace.log:</p> <pre><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32. -</code></pre> +</code></pre></li> -<ul> -<li>And there are many of these errors every day for the past month:</li> -</ul> +<li><p>And there are many of these errors every day for the past month:</p> <pre><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.* dspace.log.2017-11-21:4 @@ -465,10 +466,9 @@ dspace.log.2017-12-30:89 dspace.log.2017-12-31:53 dspace.log.2018-01-01:45 dspace.log.2018-01-02:34 -</code></pre> +</code></pre></li> -<ul> -<li>Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let&rsquo;s Encrypt if it&rsquo;s just a handful of domains</li> +<li><p>Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let&rsquo;s Encrypt if it&rsquo;s just a handful of domains</p></li> </ul> @@ -503,20 +503,18 @@ dspace.log.2018-01-02:34 <h2 id="2017-11-02">2017-11-02</h2> <ul> -<li>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</li> -</ul> +<li><p>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</p> <pre><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log 0 -</code></pre> +</code></pre></li> -<ul> -<li>Generate list of authors on CGSpace for Peter to go through and correct:</li> -</ul> +<li><p>Generate list of authors on CGSpace for Peter to go through and correct:</p> <pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv; COPY 54701 -</code></pre> +</code></pre></li> +</ul> @@ -528,15 +526,14 @@ COPY 54701 <h2 id="2017-10-01">2017-10-01</h2> <ul> -<li>Peter emailed to point out that many items in the <a href="https://cgspace.cgiar.org/handle/10568/2703">ILRI archive collection</a> have multiple handles:</li> -</ul> +<li><p>Peter emailed to point out that many items in the <a href="https://cgspace.cgiar.org/handle/10568/2703">ILRI archive collection</a> have multiple handles:</p> <pre><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336 -</code></pre> +</code></pre></li> -<ul> -<li>There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li> -<li>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</li> +<li><p>There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</p></li> + +<li><p>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</p></li> </ul> @@ -655,11 +652,12 @@ COPY 54701 <ul> <li>Remove redundant/duplicate text in the DSpace submission license</li> -<li>Testing the CMYK patch on a collection with 650 items:</li> -</ul> + +<li><p>Testing the CMYK patch on a collection with 650 items:</p> <pre><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt -</code></pre> +</code></pre></li> +</ul> @@ -685,12 +683,13 @@ COPY 54701 <li>Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI</li> <li>Filed an issue on DSpace issue tracker for the <code>filter-media</code> bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: <a href="https://jira.duraspace.org/browse/DS-3516">DS-3516</a></li> <li>Discovered that the ImageMagic <code>filter-media</code> plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK</li> -<li>Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing regeneration using DSpace 5.x&rsquo;s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999"><sup>10568</sup>&frasl;<sub>51999</sub></a>):</li> -</ul> + +<li><p>Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing regeneration using DSpace 5.x&rsquo;s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999"><sup>10568</sup>&frasl;<sub>51999</sub></a>):</p> <pre><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000 -</code></pre> +</code></pre></li> +</ul> @@ -702,23 +701,22 @@ COPY 54701 <h2 id="2017-02-07">2017-02-07</h2> <ul> -<li>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</li> -</ul> +<li><p>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</p> <pre><code>dspace=# select * from collection2item where item_id = '80278'; - id | collection_id | item_id +id | collection_id | item_id -------+---------------+--------- - 92551 | 313 | 80278 - 92550 | 313 | 80278 - 90774 | 1051 | 80278 +92551 | 313 | 80278 +92550 | 313 | 80278 +90774 | 1051 | 80278 (3 rows) dspace=# delete from collection2item where id = 92551 and item_id = 80278; DELETE 1 -</code></pre> +</code></pre></li> -<ul> -<li>Create issue on GitHub to track the addition of CCAFS Phase II project tags (<a href="https://github.com/ilri/DSpace/issues/301">#301</a>)</li> -<li>Looks like we&rsquo;ll be using <code>cg.identifier.ccafsprojectpii</code> as the field name</li> +<li><p>Create issue on GitHub to track the addition of CCAFS Phase II project tags (<a href="https://github.com/ilri/DSpace/issues/301">#301</a>)</p></li> + +<li><p>Looks like we&rsquo;ll be using <code>cg.identifier.ccafsprojectpii</code> as the field name</p></li> </ul> @@ -747,20 +745,21 @@ DELETE 1 <ul> <li>CGSpace was down for five hours in the morning while I was sleeping</li> -<li>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</li> -</ul> + +<li><p>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</p> <pre><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&quot;dc.title&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&quot;THUMBNAIL&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&quot;-1&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;) -</code></pre> +</code></pre></li> -<ul> -<li>I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade</li> -<li>I&rsquo;ve raised a ticket with Atmire to ask</li> -<li>Another worrying error from dspace.log is:</li> +<li><p>I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade</p></li> + +<li><p>I&rsquo;ve raised a ticket with Atmire to ask</p></li> + +<li><p>Another worrying error from dspace.log is:</p></li> </ul> @@ -795,11 +794,12 @@ DELETE 1 <li>ORCIDs only</li> <li>ORCIDs plus normal authors</li> </ul></li> -<li>I exported a random item&rsquo;s metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</li> -</ul> + +<li><p>I exported a random item&rsquo;s metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</p> <pre><code>0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X -</code></pre> +</code></pre></li> +</ul> @@ -814,11 +814,12 @@ DELETE 1 <li>Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors</li> <li>Discuss how the migration of CGIAR&rsquo;s Active Directory to a flat structure will break our LDAP groups in DSpace</li> <li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li> -<li>It looks like we might be able to use OUs now, instead of DCs:</li> -</ul> + +<li><p>It looks like we might be able to use OUs now, instead of DCs:</p> <pre><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot; -</code></pre> +</code></pre></li> +</ul> @@ -835,13 +836,14 @@ DELETE 1 <li>Bootstrap is at 3.3.0 but upstream is at 3.3.7, and upgrading to anything beyond 3.3.1 breaks glyphicons and probably more</li> <li>bower stuff is a dead end, waste of time, too many issues</li> <li>Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of <code>fonts</code>)</li> -<li>Start working on DSpace 5.1 → 5.5 port:</li> -</ul> + +<li><p>Start working on DSpace 5.1 → 5.5 port:</p> <pre><code>$ git checkout -b 55new 5_x-prod $ git reset --hard ilri/5_x-prod $ git rebase -i dspace-5.5 -</code></pre> +</code></pre></li> +</ul> @@ -854,19 +856,18 @@ $ git rebase -i dspace-5.5 <ul> <li>Add <code>dc.description.sponsorship</code> to Discovery sidebar facets and make investors clickable in item view (<a href="https://github.com/ilri/DSpace/issues/232">#232</a>)</li> -<li>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</li> -</ul> + +<li><p>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</p> <pre><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$'; UPDATE 95 dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$'; - text_value +text_value ------------ (0 rows) -</code></pre> +</code></pre></li> -<ul> -<li>In this case the select query was showing 95 results before the update</li> +<li><p>In this case the select query was showing 95 results before the update</p></li> </ul> @@ -899,12 +900,13 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and <ul> <li>Since yesterday there have been 10,000 REST errors and the site has been unstable again</li> <li>I have blocked access to the API now</li> -<li>There are 3,000 IPs accessing the REST API in a 24-hour period!</li> -</ul> + +<li><p>There are 3,000 IPs accessing the REST API in a 24-hour period!</p> <pre><code># awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l 3168 -</code></pre> +</code></pre></li> +</ul> @@ -985,15 +987,15 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and <h2 id="2015-12-02">2015-12-02</h2> <ul> -<li>Replace <code>lzop</code> with <code>xz</code> in log compression cron jobs on DSpace Test—it uses less space:</li> -</ul> +<li><p>Replace <code>lzop</code> with <code>xz</code> in log compression cron jobs on DSpace Test—it uses less space:</p> <pre><code># cd /home/dspacetest.cgiar.org/log # ls -lh dspace.log.2015-11-18* -rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18 -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo -rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz -</code></pre> +</code></pre></li> +</ul> @@ -1007,12 +1009,13 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and <ul> <li>CGSpace went down</li> <li>Looks like DSpace exhausted its PostgreSQL connection pool</li> -<li>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</li> -</ul> + +<li><p>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</p> <pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace 78 -</code></pre> +</code></pre></li> +</ul> diff --git a/docs/page/2/index.html b/docs/page/2/index.html index 3c6c30aae..d9992b369 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -15,7 +15,7 @@ - + @@ -101,18 +101,16 @@

    2018-07-01

    +
  • I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:

    $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
    -
    +
  • - +
  • During the mvn package stage on the 5.8 branch I kept getting issues with java running out of memory:

    There is insufficient memory for the Java Runtime Environment to continue.
    -
    +
  • + Read more → @@ -139,23 +137,23 @@
  • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn’t build
  • I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
  • -
  • I proofed and tested the ILRI author corrections that Peter sent back to me this week:
  • - + +
  • I proofed and tested the ILRI author corrections that Peter sent back to me this week:

    $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
    -
    +
  • - +
  • I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018

  • + +
  • Time to index ~70,000 items on CGSpace:

    $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
     
     real    74m42.646s
     user    8m5.056s
     sys     2m7.289s
    -
    +
  • + Read more → @@ -279,26 +277,25 @@ sys 2m7.289s
  • I didn’t get any load alerts from Linode and the REST and XMLUI logs don’t show anything out of the ordinary
  • The nginx logs show HTTP 200s until 02/Jan/2018:11:27:17 +0000 when Uptime Robot got an HTTP 500
  • In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
  • -
  • And just before that I see this:
  • - + +
  • And just before that I see this:

    Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
    -
    +
  • - +
  • Ah hah! So the pool was actually empty!

  • + +
  • I need to increase that, let’s try to bump it up from 50 to 75

  • + +
  • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw

  • + +
  • I notice this error quite a few times in dspace.log:

    2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
     org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.
    -
    +
  • - +
  • And there are many of these errors every day for the past month:

    $ grep -c "Error while searching for sidebar facets" dspace.log.*
     dspace.log.2017-11-21:4
    @@ -344,10 +341,9 @@ dspace.log.2017-12-30:89
     dspace.log.2017-12-31:53
     dspace.log.2018-01-01:45
     dspace.log.2018-01-02:34
    -
    +
  • - Read more → @@ -400,20 +396,18 @@ dspace.log.2018-01-02:34

    2017-11-02

    +
  • Today there have been no hits by CORE and no alerts from Linode (coincidence?)

    # grep -c "CORE" /var/log/nginx/access.log
     0
    -
    +
  • - +
  • Generate list of authors on CGSpace for Peter to go through and correct:

    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
     COPY 54701
    -
    +
  • + Read more → @@ -434,15 +428,14 @@ COPY 54701

    2017-10-01

    +
  • Peter emailed to point out that many items in the ILRI archive collection have multiple handles:

    http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
    -
    +
  • - Read more → diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 65f0232c2..06f362431 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -15,7 +15,7 @@ - + @@ -261,11 +261,12 @@ + +
  • Testing the CMYK patch on a collection with 650 items:

    $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
    -
    +
  • + Read more → @@ -300,12 +301,13 @@
  • Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
  • Filed an issue on DSpace issue tracker for the filter-media bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
  • Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
  • -
  • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 1056851999):
  • - + +
  • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 1056851999):

    $ identify ~/Desktop/alc_contrastes_desafios.jpg
     /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
    -
    +
  • + Read more → @@ -326,23 +328,22 @@

    2017-02-07

    +
  • An item was mapped twice erroneously again, so I had to remove one of the mappings manually:

    dspace=# select * from collection2item where item_id = '80278';
    -  id   | collection_id | item_id
    +id   | collection_id | item_id
     -------+---------------+---------
    - 92551 |           313 |   80278
    - 92550 |           313 |   80278
    - 90774 |          1051 |   80278
    +92551 |           313 |   80278
    +92550 |           313 |   80278
    +90774 |          1051 |   80278
     (3 rows)
     dspace=# delete from collection2item where id = 92551 and item_id = 80278;
     DELETE 1
    -
    +
  • - Read more → diff --git a/docs/page/4/index.html b/docs/page/4/index.html index 7d6b7cdf3..901c4e868 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -15,7 +15,7 @@ - + @@ -102,20 +102,21 @@ + +
  • While looking in the logs for errors, I see tons of warnings about Atmire MQM:

    2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
    -
    +
  • - Read more → @@ -168,11 +169,12 @@
  • ORCIDs only
  • ORCIDs plus normal authors
  • -
  • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
  • - + +
  • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:

    0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
    -
    +
  • + Read more → @@ -196,11 +198,12 @@
  • Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors
  • Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace
  • We had been using DC=ILRI to determine whether a user was ILRI or not
  • -
  • It looks like we might be able to use OUs now, instead of DCs:
  • - + +
  • It looks like we might be able to use OUs now, instead of DCs:

    $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
    -
    +
  • + Read more → @@ -226,13 +229,14 @@
  • Bootstrap is at 3.3.0 but upstream is at 3.3.7, and upgrading to anything beyond 3.3.1 breaks glyphicons and probably more
  • bower stuff is a dead end, waste of time, too many issues
  • Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of fonts)
  • -
  • Start working on DSpace 5.1 → 5.5 port:
  • - + +
  • Start working on DSpace 5.1 → 5.5 port:

    $ git checkout -b 55new 5_x-prod
     $ git reset --hard ilri/5_x-prod
     $ git rebase -i dspace-5.5
    -
    +
  • + Read more → @@ -254,19 +258,18 @@ $ git rebase -i dspace-5.5 + +
  • I think this query should find and replace all authors that have “,” at the end of their names:

    dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
     UPDATE 95
     dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
    - text_value
    +text_value
     ------------
     (0 rows)
    -
    +
  • - Read more → @@ -317,12 +320,13 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and + +
  • There are 3,000 IPs accessing the REST API in a 24-hour period!

    # awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
     3168
    -
    +
  • + Read more → diff --git a/docs/page/5/index.html b/docs/page/5/index.html index 396a5742a..e7b5dbc67 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -15,7 +15,7 @@ - + @@ -156,15 +156,15 @@

    2015-12-02

    +
  • Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less space:

    # cd /home/dspacetest.cgiar.org/log
     # ls -lh dspace.log.2015-11-18*
     -rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
     -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
     -rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz
    -
    +
  • + Read more → @@ -187,12 +187,13 @@ + +
  • Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:

    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
     78
    -
    +
  • + Read more → diff --git a/docs/posts/index.html b/docs/posts/index.html index f70f37d06..ecbcc8c19 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -15,7 +15,7 @@ - + @@ -108,15 +108,14 @@
  • Apparently if the item is in the workflowitem table it is submitted to a workflow
  • And if it is in the workspaceitem table it is in the pre-submitted state
  • -
  • The item seems to be in a pre-submitted state, so I tried to delete it from there:
  • - + +
  • The item seems to be in a pre-submitted state, so I tried to delete it from there:

    dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
     DELETE 1
    -
    +
  • - Read more → @@ -143,27 +142,27 @@ DELETE 1 -
  • There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today + +
  • There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today

  • - +
  • I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
    -   4432 200
    -
    +4432 200 +
  • + - +
  • In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses

  • + +
  • Apply country and region corrections and deletions on DSpace Test and CGSpace:

    $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
     $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
    -
    +
  • + Read more → @@ -218,27 +217,27 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace + +
  • The top IPs before, during, and after this latest alert tonight were:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    245 207.46.13.5
    -    332 54.70.40.11
    -    385 5.143.231.38
    -    405 207.46.13.173
    -    405 207.46.13.75
    -   1117 66.249.66.219
    -   1121 35.237.175.180
    -   1546 5.9.6.51
    -   2474 45.5.186.2
    -   5490 85.25.237.71
    -
    +245 207.46.13.5 +332 54.70.40.11 +385 5.143.231.38 +405 207.46.13.173 +405 207.46.13.75 +1117 66.249.66.219 +1121 35.237.175.180 +1546 5.9.6.51 +2474 45.5.186.2 +5490 85.25.237.71 +
  • - +
  • 85.25.237.71 is the “Linguee Bot” that I first saw last month

  • + +
  • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase

  • + +
  • There were just over 3 million accesses in the nginx logs last month:

    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
     3018243
    @@ -246,7 +245,8 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     real    0m19.873s
     user    0m22.203s
     sys     0m1.979s
    -
    +
  • + Read more → @@ -268,21 +268,22 @@ sys 0m1.979s + +
  • I don’t see anything interesting in the web server logs around that time though:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -     92 40.77.167.4
    -     99 210.7.29.100
    -    120 38.126.157.45
    -    177 35.237.175.180
    -    177 40.77.167.32
    -    216 66.249.75.219
    -    225 18.203.76.93
    -    261 46.101.86.248
    -    357 207.46.13.1
    -    903 54.70.40.11
    -
    + 92 40.77.167.4 + 99 210.7.29.100 +120 38.126.157.45 +177 35.237.175.180 +177 40.77.167.32 +216 66.249.75.219 +225 18.203.76.93 +261 46.101.86.248 +357 207.46.13.1 +903 54.70.40.11 +
  • + Read more → @@ -411,21 +412,24 @@ sys 0m1.979s

    2018-08-01

    +
  • DSpace Test had crashed at some point yesterday morning and I see the following in dmesg:

    [Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
     [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
     [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    -
    +
  • - Read more → diff --git a/docs/posts/index.xml b/docs/posts/index.xml index daa0b0309..5563865bc 100644 --- a/docs/posts/index.xml +++ b/docs/posts/index.xml @@ -27,15 +27,14 @@ <li>Apparently if the item is in the <code>workflowitem</code> table it is submitted to a workflow</li> <li>And if it is in the <code>workspaceitem</code> table it is in the pre-submitted state</li> </ul></li> -<li>The item seems to be in a pre-submitted state, so I tried to delete it from there:</li> -</ul> + +<li><p>The item seems to be in a pre-submitted state, so I tried to delete it from there:</p> <pre><code>dspace=# DELETE FROM workspaceitem WHERE item_id=74648; DELETE 1 -</code></pre> +</code></pre></li> -<ul> -<li>But after this I tried to delete the item from the XMLUI and it is <em>still</em> present&hellip;</li> +<li><p>But after this I tried to delete the item from the XMLUI and it is <em>still</em> present&hellip;</p></li> </ul> @@ -53,27 +52,27 @@ DELETE 1 <ul> <li>They asked if we had plans to enable RDF support in CGSpace</li> </ul></li> -<li>There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today + +<li><p>There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today</p> <ul> -<li>I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!</li> -</ul></li> -</ul> +<li><p>I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!</p> <pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5 - 4432 200 -</code></pre> +4432 200 +</code></pre></li> +</ul></li> -<ul> -<li>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</li> -<li>Apply country and region corrections and deletions on DSpace Test and CGSpace:</li> -</ul> +<li><p>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</p></li> + +<li><p>Apply country and region corrections and deletions on DSpace Test and CGSpace:</p> <pre><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d -</code></pre> +</code></pre></li> +</ul> @@ -110,27 +109,27 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace <ul> <li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li> -<li>The top IPs before, during, and after this latest alert tonight were:</li> -</ul> + +<li><p>The top IPs before, during, and after this latest alert tonight were:</p> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 - 245 207.46.13.5 - 332 54.70.40.11 - 385 5.143.231.38 - 405 207.46.13.173 - 405 207.46.13.75 - 1117 66.249.66.219 - 1121 35.237.175.180 - 1546 5.9.6.51 - 2474 45.5.186.2 - 5490 85.25.237.71 -</code></pre> +245 207.46.13.5 +332 54.70.40.11 +385 5.143.231.38 +405 207.46.13.173 +405 207.46.13.75 +1117 66.249.66.219 +1121 35.237.175.180 +1546 5.9.6.51 +2474 45.5.186.2 +5490 85.25.237.71 +</code></pre></li> -<ul> -<li><code>85.25.237.71</code> is the &ldquo;Linguee Bot&rdquo; that I first saw last month</li> -<li>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</li> -<li>There were just over 3 million accesses in the nginx logs last month:</li> -</ul> +<li><p><code>85.25.237.71</code> is the &ldquo;Linguee Bot&rdquo; that I first saw last month</p></li> + +<li><p>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</p></li> + +<li><p>There were just over 3 million accesses in the nginx logs last month:</p> <pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot; 3018243 @@ -138,7 +137,8 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace real 0m19.873s user 0m22.203s sys 0m1.979s -</code></pre> +</code></pre></li> +</ul> @@ -151,21 +151,22 @@ sys 0m1.979s <ul> <li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li> -<li>I don&rsquo;t see anything interesting in the web server logs around that time though:</li> -</ul> + +<li><p>I don&rsquo;t see anything interesting in the web server logs around that time though:</p> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 - 92 40.77.167.4 - 99 210.7.29.100 - 120 38.126.157.45 - 177 35.237.175.180 - 177 40.77.167.32 - 216 66.249.75.219 - 225 18.203.76.93 - 261 46.101.86.248 - 357 207.46.13.1 - 903 54.70.40.11 -</code></pre> + 92 40.77.167.4 + 99 210.7.29.100 +120 38.126.157.45 +177 35.237.175.180 +177 40.77.167.32 +216 66.249.75.219 +225 18.203.76.93 +261 46.101.86.248 +357 207.46.13.1 +903 54.70.40.11 +</code></pre></li> +</ul> @@ -249,21 +250,24 @@ sys 0m1.979s <h2 id="2018-08-01">2018-08-01</h2> <ul> -<li>DSpace Test had crashed at some point yesterday morning and I see the following in <code>dmesg</code>:</li> -</ul> +<li><p>DSpace Test had crashed at some point yesterday morning and I see the following in <code>dmesg</code>:</p> <pre><code>[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB -</code></pre> +</code></pre></li> -<ul> -<li>Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight</li> -<li>From the DSpace log I see that eventually Solr stopped responding, so I guess the <code>java</code> process that was OOM killed above was Tomcat&rsquo;s</li> -<li>I&rsquo;m not sure why Tomcat didn&rsquo;t crash with an OutOfMemoryError&hellip;</li> -<li>Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core</li> -<li>The server only has 8GB of RAM so we&rsquo;ll eventually need to upgrade to a larger one because we&rsquo;ll start starving the OS, PostgreSQL, and command line batch processes</li> -<li>I ran all system updates on DSpace Test and rebooted it</li> +<li><p>Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight</p></li> + +<li><p>From the DSpace log I see that eventually Solr stopped responding, so I guess the <code>java</code> process that was OOM killed above was Tomcat&rsquo;s</p></li> + +<li><p>I&rsquo;m not sure why Tomcat didn&rsquo;t crash with an OutOfMemoryError&hellip;</p></li> + +<li><p>Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core</p></li> + +<li><p>The server only has 8GB of RAM so we&rsquo;ll eventually need to upgrade to a larger one because we&rsquo;ll start starving the OS, PostgreSQL, and command line batch processes</p></li> + +<li><p>I ran all system updates on DSpace Test and rebooted it</p></li> </ul> @@ -276,18 +280,16 @@ sys 0m1.979s <h2 id="2018-07-01">2018-07-01</h2> <ul> -<li>I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:</li> -</ul> +<li><p>I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:</p> <pre><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace -</code></pre> +</code></pre></li> -<ul> -<li>During the <code>mvn package</code> stage on the 5.8 branch I kept getting issues with java running out of memory:</li> -</ul> +<li><p>During the <code>mvn package</code> stage on the 5.8 branch I kept getting issues with java running out of memory:</p> <pre><code>There is insufficient memory for the Java Runtime Environment to continue. -</code></pre> +</code></pre></li> +</ul> @@ -305,23 +307,23 @@ sys 0m1.979s <li>There seems to be a problem with the CUA and L&amp;R versions in <code>pom.xml</code> because they are using SNAPSHOT and it doesn&rsquo;t build</li> </ul></li> <li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li> -<li>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</li> -</ul> + +<li><p>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</p> <pre><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n -</code></pre> +</code></pre></li> -<ul> -<li>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-03/">March, 2018</a></li> -<li>Time to index ~70,000 items on CGSpace:</li> -</ul> +<li><p>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-03/">March, 2018</a></p></li> + +<li><p>Time to index ~70,000 items on CGSpace:</p> <pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b real 74m42.646s user 8m5.056s sys 2m7.289s -</code></pre> +</code></pre></li> +</ul> @@ -400,26 +402,25 @@ sys 2m7.289s <li>I didn&rsquo;t get any load alerts from Linode and the REST and XMLUI logs don&rsquo;t show anything out of the ordinary</li> <li>The nginx logs show HTTP 200s until <code>02/Jan/2018:11:27:17 +0000</code> when Uptime Robot got an HTTP 500</li> <li>In dspace.log around that time I see many errors like &ldquo;Client closed the connection before file download was complete&rdquo;</li> -<li>And just before that I see this:</li> -</ul> + +<li><p>And just before that I see this:</p> <pre><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000]. -</code></pre> +</code></pre></li> -<ul> -<li>Ah hah! So the pool was actually empty!</li> -<li>I need to increase that, let&rsquo;s try to bump it up from 50 to 75</li> -<li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&rsquo;t know what the hell Uptime Robot saw</li> -<li>I notice this error quite a few times in dspace.log:</li> -</ul> +<li><p>Ah hah! So the pool was actually empty!</p></li> + +<li><p>I need to increase that, let&rsquo;s try to bump it up from 50 to 75</p></li> + +<li><p>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&rsquo;t know what the hell Uptime Robot saw</p></li> + +<li><p>I notice this error quite a few times in dspace.log:</p> <pre><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32. -</code></pre> +</code></pre></li> -<ul> -<li>And there are many of these errors every day for the past month:</li> -</ul> +<li><p>And there are many of these errors every day for the past month:</p> <pre><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.* dspace.log.2017-11-21:4 @@ -465,10 +466,9 @@ dspace.log.2017-12-30:89 dspace.log.2017-12-31:53 dspace.log.2018-01-01:45 dspace.log.2018-01-02:34 -</code></pre> +</code></pre></li> -<ul> -<li>Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let&rsquo;s Encrypt if it&rsquo;s just a handful of domains</li> +<li><p>Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let&rsquo;s Encrypt if it&rsquo;s just a handful of domains</p></li> </ul> @@ -503,20 +503,18 @@ dspace.log.2018-01-02:34 <h2 id="2017-11-02">2017-11-02</h2> <ul> -<li>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</li> -</ul> +<li><p>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</p> <pre><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log 0 -</code></pre> +</code></pre></li> -<ul> -<li>Generate list of authors on CGSpace for Peter to go through and correct:</li> -</ul> +<li><p>Generate list of authors on CGSpace for Peter to go through and correct:</p> <pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv; COPY 54701 -</code></pre> +</code></pre></li> +</ul> @@ -528,15 +526,14 @@ COPY 54701 <h2 id="2017-10-01">2017-10-01</h2> <ul> -<li>Peter emailed to point out that many items in the <a href="https://cgspace.cgiar.org/handle/10568/2703">ILRI archive collection</a> have multiple handles:</li> -</ul> +<li><p>Peter emailed to point out that many items in the <a href="https://cgspace.cgiar.org/handle/10568/2703">ILRI archive collection</a> have multiple handles:</p> <pre><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336 -</code></pre> +</code></pre></li> -<ul> -<li>There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li> -<li>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</li> +<li><p>There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</p></li> + +<li><p>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</p></li> </ul> @@ -655,11 +652,12 @@ COPY 54701 <ul> <li>Remove redundant/duplicate text in the DSpace submission license</li> -<li>Testing the CMYK patch on a collection with 650 items:</li> -</ul> + +<li><p>Testing the CMYK patch on a collection with 650 items:</p> <pre><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt -</code></pre> +</code></pre></li> +</ul> @@ -685,12 +683,13 @@ COPY 54701 <li>Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI</li> <li>Filed an issue on DSpace issue tracker for the <code>filter-media</code> bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: <a href="https://jira.duraspace.org/browse/DS-3516">DS-3516</a></li> <li>Discovered that the ImageMagic <code>filter-media</code> plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK</li> -<li>Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing regeneration using DSpace 5.x&rsquo;s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999"><sup>10568</sup>&frasl;<sub>51999</sub></a>):</li> -</ul> + +<li><p>Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing regeneration using DSpace 5.x&rsquo;s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999"><sup>10568</sup>&frasl;<sub>51999</sub></a>):</p> <pre><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000 -</code></pre> +</code></pre></li> +</ul> @@ -702,23 +701,22 @@ COPY 54701 <h2 id="2017-02-07">2017-02-07</h2> <ul> -<li>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</li> -</ul> +<li><p>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</p> <pre><code>dspace=# select * from collection2item where item_id = '80278'; - id | collection_id | item_id +id | collection_id | item_id -------+---------------+--------- - 92551 | 313 | 80278 - 92550 | 313 | 80278 - 90774 | 1051 | 80278 +92551 | 313 | 80278 +92550 | 313 | 80278 +90774 | 1051 | 80278 (3 rows) dspace=# delete from collection2item where id = 92551 and item_id = 80278; DELETE 1 -</code></pre> +</code></pre></li> -<ul> -<li>Create issue on GitHub to track the addition of CCAFS Phase II project tags (<a href="https://github.com/ilri/DSpace/issues/301">#301</a>)</li> -<li>Looks like we&rsquo;ll be using <code>cg.identifier.ccafsprojectpii</code> as the field name</li> +<li><p>Create issue on GitHub to track the addition of CCAFS Phase II project tags (<a href="https://github.com/ilri/DSpace/issues/301">#301</a>)</p></li> + +<li><p>Looks like we&rsquo;ll be using <code>cg.identifier.ccafsprojectpii</code> as the field name</p></li> </ul> @@ -747,20 +745,21 @@ DELETE 1 <ul> <li>CGSpace was down for five hours in the morning while I was sleeping</li> -<li>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</li> -</ul> + +<li><p>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</p> <pre><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&quot;dc.title&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&quot;THUMBNAIL&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&quot;-1&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;) -</code></pre> +</code></pre></li> -<ul> -<li>I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade</li> -<li>I&rsquo;ve raised a ticket with Atmire to ask</li> -<li>Another worrying error from dspace.log is:</li> +<li><p>I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade</p></li> + +<li><p>I&rsquo;ve raised a ticket with Atmire to ask</p></li> + +<li><p>Another worrying error from dspace.log is:</p></li> </ul> @@ -795,11 +794,12 @@ DELETE 1 <li>ORCIDs only</li> <li>ORCIDs plus normal authors</li> </ul></li> -<li>I exported a random item&rsquo;s metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</li> -</ul> + +<li><p>I exported a random item&rsquo;s metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</p> <pre><code>0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X -</code></pre> +</code></pre></li> +</ul> @@ -814,11 +814,12 @@ DELETE 1 <li>Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors</li> <li>Discuss how the migration of CGIAR&rsquo;s Active Directory to a flat structure will break our LDAP groups in DSpace</li> <li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li> -<li>It looks like we might be able to use OUs now, instead of DCs:</li> -</ul> + +<li><p>It looks like we might be able to use OUs now, instead of DCs:</p> <pre><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot; -</code></pre> +</code></pre></li> +</ul> @@ -835,13 +836,14 @@ DELETE 1 <li>Bootstrap is at 3.3.0 but upstream is at 3.3.7, and upgrading to anything beyond 3.3.1 breaks glyphicons and probably more</li> <li>bower stuff is a dead end, waste of time, too many issues</li> <li>Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of <code>fonts</code>)</li> -<li>Start working on DSpace 5.1 → 5.5 port:</li> -</ul> + +<li><p>Start working on DSpace 5.1 → 5.5 port:</p> <pre><code>$ git checkout -b 55new 5_x-prod $ git reset --hard ilri/5_x-prod $ git rebase -i dspace-5.5 -</code></pre> +</code></pre></li> +</ul> @@ -854,19 +856,18 @@ $ git rebase -i dspace-5.5 <ul> <li>Add <code>dc.description.sponsorship</code> to Discovery sidebar facets and make investors clickable in item view (<a href="https://github.com/ilri/DSpace/issues/232">#232</a>)</li> -<li>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</li> -</ul> + +<li><p>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</p> <pre><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$'; UPDATE 95 dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$'; - text_value +text_value ------------ (0 rows) -</code></pre> +</code></pre></li> -<ul> -<li>In this case the select query was showing 95 results before the update</li> +<li><p>In this case the select query was showing 95 results before the update</p></li> </ul> @@ -899,12 +900,13 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and <ul> <li>Since yesterday there have been 10,000 REST errors and the site has been unstable again</li> <li>I have blocked access to the API now</li> -<li>There are 3,000 IPs accessing the REST API in a 24-hour period!</li> -</ul> + +<li><p>There are 3,000 IPs accessing the REST API in a 24-hour period!</p> <pre><code># awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l 3168 -</code></pre> +</code></pre></li> +</ul> @@ -985,15 +987,15 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and <h2 id="2015-12-02">2015-12-02</h2> <ul> -<li>Replace <code>lzop</code> with <code>xz</code> in log compression cron jobs on DSpace Test—it uses less space:</li> -</ul> +<li><p>Replace <code>lzop</code> with <code>xz</code> in log compression cron jobs on DSpace Test—it uses less space:</p> <pre><code># cd /home/dspacetest.cgiar.org/log # ls -lh dspace.log.2015-11-18* -rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18 -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo -rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz -</code></pre> +</code></pre></li> +</ul> @@ -1007,12 +1009,13 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and <ul> <li>CGSpace went down</li> <li>Looks like DSpace exhausted its PostgreSQL connection pool</li> -<li>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</li> -</ul> + +<li><p>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</p> <pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace 78 -</code></pre> +</code></pre></li> +</ul> diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index 45bad7fa4..0c2a02c46 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -15,7 +15,7 @@ - + @@ -101,18 +101,16 @@

    2018-07-01

    +
  • I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:

    $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
    -
    +
  • - +
  • During the mvn package stage on the 5.8 branch I kept getting issues with java running out of memory:

    There is insufficient memory for the Java Runtime Environment to continue.
    -
    +
  • + Read more → @@ -139,23 +137,23 @@
  • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn’t build
  • I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
  • -
  • I proofed and tested the ILRI author corrections that Peter sent back to me this week:
  • - + +
  • I proofed and tested the ILRI author corrections that Peter sent back to me this week:

    $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
    -
    +
  • - +
  • I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018

  • + +
  • Time to index ~70,000 items on CGSpace:

    $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
     
     real    74m42.646s
     user    8m5.056s
     sys     2m7.289s
    -
    +
  • + Read more → @@ -279,26 +277,25 @@ sys 2m7.289s
  • I didn’t get any load alerts from Linode and the REST and XMLUI logs don’t show anything out of the ordinary
  • The nginx logs show HTTP 200s until 02/Jan/2018:11:27:17 +0000 when Uptime Robot got an HTTP 500
  • In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
  • -
  • And just before that I see this:
  • - + +
  • And just before that I see this:

    Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
    -
    +
  • - +
  • Ah hah! So the pool was actually empty!

  • + +
  • I need to increase that, let’s try to bump it up from 50 to 75

  • + +
  • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw

  • + +
  • I notice this error quite a few times in dspace.log:

    2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
     org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.
    -
    +
  • - +
  • And there are many of these errors every day for the past month:

    $ grep -c "Error while searching for sidebar facets" dspace.log.*
     dspace.log.2017-11-21:4
    @@ -344,10 +341,9 @@ dspace.log.2017-12-30:89
     dspace.log.2017-12-31:53
     dspace.log.2018-01-01:45
     dspace.log.2018-01-02:34
    -
    +
  • - Read more → @@ -400,20 +396,18 @@ dspace.log.2018-01-02:34

    2017-11-02

    +
  • Today there have been no hits by CORE and no alerts from Linode (coincidence?)

    # grep -c "CORE" /var/log/nginx/access.log
     0
    -
    +
  • - +
  • Generate list of authors on CGSpace for Peter to go through and correct:

    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
     COPY 54701
    -
    +
  • + Read more → @@ -434,15 +428,14 @@ COPY 54701

    2017-10-01

    +
  • Peter emailed to point out that many items in the ILRI archive collection have multiple handles:

    http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
    -
    +
  • - Read more → diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index fb5bfc83e..85fb5553a 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -15,7 +15,7 @@ - + @@ -261,11 +261,12 @@ + +
  • Testing the CMYK patch on a collection with 650 items:

    $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
    -
    +
  • + Read more → @@ -300,12 +301,13 @@
  • Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
  • Filed an issue on DSpace issue tracker for the filter-media bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
  • Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
  • -
  • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 1056851999):
  • - + +
  • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 1056851999):

    $ identify ~/Desktop/alc_contrastes_desafios.jpg
     /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
    -
    +
  • + Read more → @@ -326,23 +328,22 @@

    2017-02-07

    +
  • An item was mapped twice erroneously again, so I had to remove one of the mappings manually:

    dspace=# select * from collection2item where item_id = '80278';
    -  id   | collection_id | item_id
    +id   | collection_id | item_id
     -------+---------------+---------
    - 92551 |           313 |   80278
    - 92550 |           313 |   80278
    - 90774 |          1051 |   80278
    +92551 |           313 |   80278
    +92550 |           313 |   80278
    +90774 |          1051 |   80278
     (3 rows)
     dspace=# delete from collection2item where id = 92551 and item_id = 80278;
     DELETE 1
    -
    +
  • - Read more → diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index fc520b30a..4386e0d2a 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -15,7 +15,7 @@ - + @@ -102,20 +102,21 @@ + +
  • While looking in the logs for errors, I see tons of warnings about Atmire MQM:

    2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
    -
    +
  • - Read more → @@ -168,11 +169,12 @@
  • ORCIDs only
  • ORCIDs plus normal authors
  • -
  • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
  • - + +
  • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:

    0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
    -
    +
  • + Read more → @@ -196,11 +198,12 @@
  • Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors
  • Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace
  • We had been using DC=ILRI to determine whether a user was ILRI or not
  • -
  • It looks like we might be able to use OUs now, instead of DCs:
  • - + +
  • It looks like we might be able to use OUs now, instead of DCs:

    $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
    -
    +
  • + Read more → @@ -226,13 +229,14 @@
  • Bootstrap is at 3.3.0 but upstream is at 3.3.7, and upgrading to anything beyond 3.3.1 breaks glyphicons and probably more
  • bower stuff is a dead end, waste of time, too many issues
  • Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of fonts)
  • -
  • Start working on DSpace 5.1 → 5.5 port:
  • - + +
  • Start working on DSpace 5.1 → 5.5 port:

    $ git checkout -b 55new 5_x-prod
     $ git reset --hard ilri/5_x-prod
     $ git rebase -i dspace-5.5
    -
    +
  • + Read more → @@ -254,19 +258,18 @@ $ git rebase -i dspace-5.5 + +
  • I think this query should find and replace all authors that have “,” at the end of their names:

    dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
     UPDATE 95
     dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
    - text_value
    +text_value
     ------------
     (0 rows)
    -
    +
  • - Read more → @@ -317,12 +320,13 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and + +
  • There are 3,000 IPs accessing the REST API in a 24-hour period!

    # awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
     3168
    -
    +
  • + Read more → diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 1bca620b8..a531312a9 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -15,7 +15,7 @@ - + @@ -156,15 +156,15 @@

    2015-12-02

    +
  • Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less space:

    # cd /home/dspacetest.cgiar.org/log
     # ls -lh dspace.log.2015-11-18*
     -rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
     -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
     -rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz
    -
    +
  • + Read more → @@ -187,12 +187,13 @@ + +
  • Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:

    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
     78
    -
    +
  • + Read more → diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 152625235..f27c52f71 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,30 +4,30 @@ https://alanorth.github.io/cgspace-notes/ - 2019-05-03T10:29:01+03:00 + 2019-05-03T16:33:34+03:00 0 https://alanorth.github.io/cgspace-notes/2019-05/ - 2019-05-03T10:29:01+03:00 + 2019-05-03T16:33:34+03:00 https://alanorth.github.io/cgspace-notes/tags/notes/ - 2019-05-03T10:29:01+03:00 + 2019-05-03T16:33:34+03:00 0 https://alanorth.github.io/cgspace-notes/posts/ - 2019-05-03T10:29:01+03:00 + 2019-05-03T16:33:34+03:00 0 https://alanorth.github.io/cgspace-notes/tags/ - 2019-05-03T10:29:01+03:00 + 2019-05-03T16:33:34+03:00 0 diff --git a/docs/tags/index.html b/docs/tags/index.html index 1b66e195e..8d00d282f 100644 --- a/docs/tags/index.html +++ b/docs/tags/index.html @@ -15,7 +15,7 @@ - + @@ -108,15 +108,14 @@
  • Apparently if the item is in the workflowitem table it is submitted to a workflow
  • And if it is in the workspaceitem table it is in the pre-submitted state
  • -
  • The item seems to be in a pre-submitted state, so I tried to delete it from there:
  • - + +
  • The item seems to be in a pre-submitted state, so I tried to delete it from there:

    dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
     DELETE 1
    -
    +
  • - Read more → @@ -143,27 +142,27 @@ DELETE 1 -
  • There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today + +
  • There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today

  • - +
  • I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
    -   4432 200
    -
    +4432 200 +
  • + - +
  • In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses

  • + +
  • Apply country and region corrections and deletions on DSpace Test and CGSpace:

    $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
     $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
    -
    +
  • + Read more → @@ -218,27 +217,27 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace + +
  • The top IPs before, during, and after this latest alert tonight were:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    245 207.46.13.5
    -    332 54.70.40.11
    -    385 5.143.231.38
    -    405 207.46.13.173
    -    405 207.46.13.75
    -   1117 66.249.66.219
    -   1121 35.237.175.180
    -   1546 5.9.6.51
    -   2474 45.5.186.2
    -   5490 85.25.237.71
    -
    +245 207.46.13.5 +332 54.70.40.11 +385 5.143.231.38 +405 207.46.13.173 +405 207.46.13.75 +1117 66.249.66.219 +1121 35.237.175.180 +1546 5.9.6.51 +2474 45.5.186.2 +5490 85.25.237.71 +
  • - +
  • 85.25.237.71 is the “Linguee Bot” that I first saw last month

  • + +
  • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase

  • + +
  • There were just over 3 million accesses in the nginx logs last month:

    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
     3018243
    @@ -246,7 +245,8 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     real    0m19.873s
     user    0m22.203s
     sys     0m1.979s
    -
    +
  • + Read more → @@ -268,21 +268,22 @@ sys 0m1.979s + +
  • I don’t see anything interesting in the web server logs around that time though:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -     92 40.77.167.4
    -     99 210.7.29.100
    -    120 38.126.157.45
    -    177 35.237.175.180
    -    177 40.77.167.32
    -    216 66.249.75.219
    -    225 18.203.76.93
    -    261 46.101.86.248
    -    357 207.46.13.1
    -    903 54.70.40.11
    -
    + 92 40.77.167.4 + 99 210.7.29.100 +120 38.126.157.45 +177 35.237.175.180 +177 40.77.167.32 +216 66.249.75.219 +225 18.203.76.93 +261 46.101.86.248 +357 207.46.13.1 +903 54.70.40.11 +
  • + Read more → @@ -411,21 +412,24 @@ sys 0m1.979s

    2018-08-01

    +
  • DSpace Test had crashed at some point yesterday morning and I see the following in dmesg:

    [Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
     [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
     [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    -
    +
  • - Read more → diff --git a/docs/tags/notes/index.html b/docs/tags/notes/index.html index eda4494c2..996ccd0e0 100644 --- a/docs/tags/notes/index.html +++ b/docs/tags/notes/index.html @@ -15,7 +15,7 @@ - + @@ -93,15 +93,14 @@
  • Apparently if the item is in the workflowitem table it is submitted to a workflow
  • And if it is in the workspaceitem table it is in the pre-submitted state
  • -
  • The item seems to be in a pre-submitted state, so I tried to delete it from there:
  • - + +
  • The item seems to be in a pre-submitted state, so I tried to delete it from there:

    dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
     DELETE 1
    -
    +
  • - Read more → @@ -128,27 +127,27 @@ DELETE 1 -
  • There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today + +
  • There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today

  • - +
  • I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
    -   4432 200
    -
    +4432 200 +
  • + - +
  • In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses

  • + +
  • Apply country and region corrections and deletions on DSpace Test and CGSpace:

    $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
     $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
    -
    +
  • + Read more → @@ -203,27 +202,27 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace + +
  • The top IPs before, during, and after this latest alert tonight were:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -    245 207.46.13.5
    -    332 54.70.40.11
    -    385 5.143.231.38
    -    405 207.46.13.173
    -    405 207.46.13.75
    -   1117 66.249.66.219
    -   1121 35.237.175.180
    -   1546 5.9.6.51
    -   2474 45.5.186.2
    -   5490 85.25.237.71
    -
    +245 207.46.13.5 +332 54.70.40.11 +385 5.143.231.38 +405 207.46.13.173 +405 207.46.13.75 +1117 66.249.66.219 +1121 35.237.175.180 +1546 5.9.6.51 +2474 45.5.186.2 +5490 85.25.237.71 +
  • - +
  • 85.25.237.71 is the “Linguee Bot” that I first saw last month

  • + +
  • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase

  • + +
  • There were just over 3 million accesses in the nginx logs last month:

    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
     3018243
    @@ -231,7 +230,8 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     real    0m19.873s
     user    0m22.203s
     sys     0m1.979s
    -
    +
  • + Read more → @@ -253,21 +253,22 @@ sys 0m1.979s + +
  • I don’t see anything interesting in the web server logs around that time though:

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -     92 40.77.167.4
    -     99 210.7.29.100
    -    120 38.126.157.45
    -    177 35.237.175.180
    -    177 40.77.167.32
    -    216 66.249.75.219
    -    225 18.203.76.93
    -    261 46.101.86.248
    -    357 207.46.13.1
    -    903 54.70.40.11
    -
    + 92 40.77.167.4 + 99 210.7.29.100 +120 38.126.157.45 +177 35.237.175.180 +177 40.77.167.32 +216 66.249.75.219 +225 18.203.76.93 +261 46.101.86.248 +357 207.46.13.1 +903 54.70.40.11 +
  • + Read more → @@ -396,21 +397,24 @@ sys 0m1.979s

    2018-08-01

    +
  • DSpace Test had crashed at some point yesterday morning and I see the following in dmesg:

    [Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
     [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
     [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    -
    +
  • - Read more → diff --git a/docs/tags/notes/index.xml b/docs/tags/notes/index.xml index 4a7a47e13..9e3eac282 100644 --- a/docs/tags/notes/index.xml +++ b/docs/tags/notes/index.xml @@ -27,15 +27,14 @@ <li>Apparently if the item is in the <code>workflowitem</code> table it is submitted to a workflow</li> <li>And if it is in the <code>workspaceitem</code> table it is in the pre-submitted state</li> </ul></li> -<li>The item seems to be in a pre-submitted state, so I tried to delete it from there:</li> -</ul> + +<li><p>The item seems to be in a pre-submitted state, so I tried to delete it from there:</p> <pre><code>dspace=# DELETE FROM workspaceitem WHERE item_id=74648; DELETE 1 -</code></pre> +</code></pre></li> -<ul> -<li>But after this I tried to delete the item from the XMLUI and it is <em>still</em> present&hellip;</li> +<li><p>But after this I tried to delete the item from the XMLUI and it is <em>still</em> present&hellip;</p></li> </ul> @@ -53,27 +52,27 @@ DELETE 1 <ul> <li>They asked if we had plans to enable RDF support in CGSpace</li> </ul></li> -<li>There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today + +<li><p>There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today</p> <ul> -<li>I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!</li> -</ul></li> -</ul> +<li><p>I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!</p> <pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5 - 4432 200 -</code></pre> +4432 200 +</code></pre></li> +</ul></li> -<ul> -<li>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</li> -<li>Apply country and region corrections and deletions on DSpace Test and CGSpace:</li> -</ul> +<li><p>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</p></li> + +<li><p>Apply country and region corrections and deletions on DSpace Test and CGSpace:</p> <pre><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d -</code></pre> +</code></pre></li> +</ul> @@ -110,27 +109,27 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace <ul> <li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li> -<li>The top IPs before, during, and after this latest alert tonight were:</li> -</ul> + +<li><p>The top IPs before, during, and after this latest alert tonight were:</p> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 - 245 207.46.13.5 - 332 54.70.40.11 - 385 5.143.231.38 - 405 207.46.13.173 - 405 207.46.13.75 - 1117 66.249.66.219 - 1121 35.237.175.180 - 1546 5.9.6.51 - 2474 45.5.186.2 - 5490 85.25.237.71 -</code></pre> +245 207.46.13.5 +332 54.70.40.11 +385 5.143.231.38 +405 207.46.13.173 +405 207.46.13.75 +1117 66.249.66.219 +1121 35.237.175.180 +1546 5.9.6.51 +2474 45.5.186.2 +5490 85.25.237.71 +</code></pre></li> -<ul> -<li><code>85.25.237.71</code> is the &ldquo;Linguee Bot&rdquo; that I first saw last month</li> -<li>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</li> -<li>There were just over 3 million accesses in the nginx logs last month:</li> -</ul> +<li><p><code>85.25.237.71</code> is the &ldquo;Linguee Bot&rdquo; that I first saw last month</p></li> + +<li><p>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</p></li> + +<li><p>There were just over 3 million accesses in the nginx logs last month:</p> <pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot; 3018243 @@ -138,7 +137,8 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace real 0m19.873s user 0m22.203s sys 0m1.979s -</code></pre> +</code></pre></li> +</ul> @@ -151,21 +151,22 @@ sys 0m1.979s <ul> <li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li> -<li>I don&rsquo;t see anything interesting in the web server logs around that time though:</li> -</ul> + +<li><p>I don&rsquo;t see anything interesting in the web server logs around that time though:</p> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 - 92 40.77.167.4 - 99 210.7.29.100 - 120 38.126.157.45 - 177 35.237.175.180 - 177 40.77.167.32 - 216 66.249.75.219 - 225 18.203.76.93 - 261 46.101.86.248 - 357 207.46.13.1 - 903 54.70.40.11 -</code></pre> + 92 40.77.167.4 + 99 210.7.29.100 +120 38.126.157.45 +177 35.237.175.180 +177 40.77.167.32 +216 66.249.75.219 +225 18.203.76.93 +261 46.101.86.248 +357 207.46.13.1 +903 54.70.40.11 +</code></pre></li> +</ul> @@ -249,21 +250,24 @@ sys 0m1.979s <h2 id="2018-08-01">2018-08-01</h2> <ul> -<li>DSpace Test had crashed at some point yesterday morning and I see the following in <code>dmesg</code>:</li> -</ul> +<li><p>DSpace Test had crashed at some point yesterday morning and I see the following in <code>dmesg</code>:</p> <pre><code>[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB -</code></pre> +</code></pre></li> -<ul> -<li>Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight</li> -<li>From the DSpace log I see that eventually Solr stopped responding, so I guess the <code>java</code> process that was OOM killed above was Tomcat&rsquo;s</li> -<li>I&rsquo;m not sure why Tomcat didn&rsquo;t crash with an OutOfMemoryError&hellip;</li> -<li>Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core</li> -<li>The server only has 8GB of RAM so we&rsquo;ll eventually need to upgrade to a larger one because we&rsquo;ll start starving the OS, PostgreSQL, and command line batch processes</li> -<li>I ran all system updates on DSpace Test and rebooted it</li> +<li><p>Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight</p></li> + +<li><p>From the DSpace log I see that eventually Solr stopped responding, so I guess the <code>java</code> process that was OOM killed above was Tomcat&rsquo;s</p></li> + +<li><p>I&rsquo;m not sure why Tomcat didn&rsquo;t crash with an OutOfMemoryError&hellip;</p></li> + +<li><p>Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core</p></li> + +<li><p>The server only has 8GB of RAM so we&rsquo;ll eventually need to upgrade to a larger one because we&rsquo;ll start starving the OS, PostgreSQL, and command line batch processes</p></li> + +<li><p>I ran all system updates on DSpace Test and rebooted it</p></li> </ul> @@ -276,18 +280,16 @@ sys 0m1.979s <h2 id="2018-07-01">2018-07-01</h2> <ul> -<li>I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:</li> -</ul> +<li><p>I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:</p> <pre><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace -</code></pre> +</code></pre></li> -<ul> -<li>During the <code>mvn package</code> stage on the 5.8 branch I kept getting issues with java running out of memory:</li> -</ul> +<li><p>During the <code>mvn package</code> stage on the 5.8 branch I kept getting issues with java running out of memory:</p> <pre><code>There is insufficient memory for the Java Runtime Environment to continue. -</code></pre> +</code></pre></li> +</ul> @@ -305,23 +307,23 @@ sys 0m1.979s <li>There seems to be a problem with the CUA and L&amp;R versions in <code>pom.xml</code> because they are using SNAPSHOT and it doesn&rsquo;t build</li> </ul></li> <li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li> -<li>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</li> -</ul> + +<li><p>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</p> <pre><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n -</code></pre> +</code></pre></li> -<ul> -<li>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-03/">March, 2018</a></li> -<li>Time to index ~70,000 items on CGSpace:</li> -</ul> +<li><p>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-03/">March, 2018</a></p></li> + +<li><p>Time to index ~70,000 items on CGSpace:</p> <pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b real 74m42.646s user 8m5.056s sys 2m7.289s -</code></pre> +</code></pre></li> +</ul> @@ -400,26 +402,25 @@ sys 2m7.289s <li>I didn&rsquo;t get any load alerts from Linode and the REST and XMLUI logs don&rsquo;t show anything out of the ordinary</li> <li>The nginx logs show HTTP 200s until <code>02/Jan/2018:11:27:17 +0000</code> when Uptime Robot got an HTTP 500</li> <li>In dspace.log around that time I see many errors like &ldquo;Client closed the connection before file download was complete&rdquo;</li> -<li>And just before that I see this:</li> -</ul> + +<li><p>And just before that I see this:</p> <pre><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000]. -</code></pre> +</code></pre></li> -<ul> -<li>Ah hah! So the pool was actually empty!</li> -<li>I need to increase that, let&rsquo;s try to bump it up from 50 to 75</li> -<li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&rsquo;t know what the hell Uptime Robot saw</li> -<li>I notice this error quite a few times in dspace.log:</li> -</ul> +<li><p>Ah hah! So the pool was actually empty!</p></li> + +<li><p>I need to increase that, let&rsquo;s try to bump it up from 50 to 75</p></li> + +<li><p>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&rsquo;t know what the hell Uptime Robot saw</p></li> + +<li><p>I notice this error quite a few times in dspace.log:</p> <pre><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32. -</code></pre> +</code></pre></li> -<ul> -<li>And there are many of these errors every day for the past month:</li> -</ul> +<li><p>And there are many of these errors every day for the past month:</p> <pre><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.* dspace.log.2017-11-21:4 @@ -465,10 +466,9 @@ dspace.log.2017-12-30:89 dspace.log.2017-12-31:53 dspace.log.2018-01-01:45 dspace.log.2018-01-02:34 -</code></pre> +</code></pre></li> -<ul> -<li>Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let&rsquo;s Encrypt if it&rsquo;s just a handful of domains</li> +<li><p>Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let&rsquo;s Encrypt if it&rsquo;s just a handful of domains</p></li> </ul> @@ -503,20 +503,18 @@ dspace.log.2018-01-02:34 <h2 id="2017-11-02">2017-11-02</h2> <ul> -<li>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</li> -</ul> +<li><p>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</p> <pre><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log 0 -</code></pre> +</code></pre></li> -<ul> -<li>Generate list of authors on CGSpace for Peter to go through and correct:</li> -</ul> +<li><p>Generate list of authors on CGSpace for Peter to go through and correct:</p> <pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv; COPY 54701 -</code></pre> +</code></pre></li> +</ul> @@ -528,15 +526,14 @@ COPY 54701 <h2 id="2017-10-01">2017-10-01</h2> <ul> -<li>Peter emailed to point out that many items in the <a href="https://cgspace.cgiar.org/handle/10568/2703">ILRI archive collection</a> have multiple handles:</li> -</ul> +<li><p>Peter emailed to point out that many items in the <a href="https://cgspace.cgiar.org/handle/10568/2703">ILRI archive collection</a> have multiple handles:</p> <pre><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336 -</code></pre> +</code></pre></li> -<ul> -<li>There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li> -<li>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</li> +<li><p>There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</p></li> + +<li><p>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</p></li> </ul> @@ -646,11 +643,12 @@ COPY 54701 <ul> <li>Remove redundant/duplicate text in the DSpace submission license</li> -<li>Testing the CMYK patch on a collection with 650 items:</li> -</ul> + +<li><p>Testing the CMYK patch on a collection with 650 items:</p> <pre><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt -</code></pre> +</code></pre></li> +</ul> @@ -676,12 +674,13 @@ COPY 54701 <li>Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI</li> <li>Filed an issue on DSpace issue tracker for the <code>filter-media</code> bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: <a href="https://jira.duraspace.org/browse/DS-3516">DS-3516</a></li> <li>Discovered that the ImageMagic <code>filter-media</code> plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK</li> -<li>Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing regeneration using DSpace 5.x&rsquo;s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999"><sup>10568</sup>&frasl;<sub>51999</sub></a>):</li> -</ul> + +<li><p>Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing regeneration using DSpace 5.x&rsquo;s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999"><sup>10568</sup>&frasl;<sub>51999</sub></a>):</p> <pre><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000 -</code></pre> +</code></pre></li> +</ul> @@ -693,23 +692,22 @@ COPY 54701 <h2 id="2017-02-07">2017-02-07</h2> <ul> -<li>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</li> -</ul> +<li><p>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</p> <pre><code>dspace=# select * from collection2item where item_id = '80278'; - id | collection_id | item_id +id | collection_id | item_id -------+---------------+--------- - 92551 | 313 | 80278 - 92550 | 313 | 80278 - 90774 | 1051 | 80278 +92551 | 313 | 80278 +92550 | 313 | 80278 +90774 | 1051 | 80278 (3 rows) dspace=# delete from collection2item where id = 92551 and item_id = 80278; DELETE 1 -</code></pre> +</code></pre></li> -<ul> -<li>Create issue on GitHub to track the addition of CCAFS Phase II project tags (<a href="https://github.com/ilri/DSpace/issues/301">#301</a>)</li> -<li>Looks like we&rsquo;ll be using <code>cg.identifier.ccafsprojectpii</code> as the field name</li> +<li><p>Create issue on GitHub to track the addition of CCAFS Phase II project tags (<a href="https://github.com/ilri/DSpace/issues/301">#301</a>)</p></li> + +<li><p>Looks like we&rsquo;ll be using <code>cg.identifier.ccafsprojectpii</code> as the field name</p></li> </ul> @@ -738,20 +736,21 @@ DELETE 1 <ul> <li>CGSpace was down for five hours in the morning while I was sleeping</li> -<li>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</li> -</ul> + +<li><p>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</p> <pre><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&quot;dc.title&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&quot;THUMBNAIL&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&quot;-1&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;) -</code></pre> +</code></pre></li> -<ul> -<li>I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade</li> -<li>I&rsquo;ve raised a ticket with Atmire to ask</li> -<li>Another worrying error from dspace.log is:</li> +<li><p>I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade</p></li> + +<li><p>I&rsquo;ve raised a ticket with Atmire to ask</p></li> + +<li><p>Another worrying error from dspace.log is:</p></li> </ul> @@ -786,11 +785,12 @@ DELETE 1 <li>ORCIDs only</li> <li>ORCIDs plus normal authors</li> </ul></li> -<li>I exported a random item&rsquo;s metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</li> -</ul> + +<li><p>I exported a random item&rsquo;s metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</p> <pre><code>0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X -</code></pre> +</code></pre></li> +</ul> @@ -805,11 +805,12 @@ DELETE 1 <li>Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors</li> <li>Discuss how the migration of CGIAR&rsquo;s Active Directory to a flat structure will break our LDAP groups in DSpace</li> <li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li> -<li>It looks like we might be able to use OUs now, instead of DCs:</li> -</ul> + +<li><p>It looks like we might be able to use OUs now, instead of DCs:</p> <pre><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot; -</code></pre> +</code></pre></li> +</ul> @@ -826,13 +827,14 @@ DELETE 1 <li>Bootstrap is at 3.3.0 but upstream is at 3.3.7, and upgrading to anything beyond 3.3.1 breaks glyphicons and probably more</li> <li>bower stuff is a dead end, waste of time, too many issues</li> <li>Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of <code>fonts</code>)</li> -<li>Start working on DSpace 5.1 → 5.5 port:</li> -</ul> + +<li><p>Start working on DSpace 5.1 → 5.5 port:</p> <pre><code>$ git checkout -b 55new 5_x-prod $ git reset --hard ilri/5_x-prod $ git rebase -i dspace-5.5 -</code></pre> +</code></pre></li> +</ul> @@ -845,19 +847,18 @@ $ git rebase -i dspace-5.5 <ul> <li>Add <code>dc.description.sponsorship</code> to Discovery sidebar facets and make investors clickable in item view (<a href="https://github.com/ilri/DSpace/issues/232">#232</a>)</li> -<li>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</li> -</ul> + +<li><p>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</p> <pre><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$'; UPDATE 95 dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$'; - text_value +text_value ------------ (0 rows) -</code></pre> +</code></pre></li> -<ul> -<li>In this case the select query was showing 95 results before the update</li> +<li><p>In this case the select query was showing 95 results before the update</p></li> </ul> @@ -890,12 +891,13 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and <ul> <li>Since yesterday there have been 10,000 REST errors and the site has been unstable again</li> <li>I have blocked access to the API now</li> -<li>There are 3,000 IPs accessing the REST API in a 24-hour period!</li> -</ul> + +<li><p>There are 3,000 IPs accessing the REST API in a 24-hour period!</p> <pre><code># awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l 3168 -</code></pre> +</code></pre></li> +</ul> @@ -976,15 +978,15 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and <h2 id="2015-12-02">2015-12-02</h2> <ul> -<li>Replace <code>lzop</code> with <code>xz</code> in log compression cron jobs on DSpace Test—it uses less space:</li> -</ul> +<li><p>Replace <code>lzop</code> with <code>xz</code> in log compression cron jobs on DSpace Test—it uses less space:</p> <pre><code># cd /home/dspacetest.cgiar.org/log # ls -lh dspace.log.2015-11-18* -rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18 -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo -rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz -</code></pre> +</code></pre></li> +</ul> @@ -998,12 +1000,13 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and <ul> <li>CGSpace went down</li> <li>Looks like DSpace exhausted its PostgreSQL connection pool</li> -<li>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</li> -</ul> + +<li><p>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</p> <pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace 78 -</code></pre> +</code></pre></li> +</ul> diff --git a/docs/tags/notes/page/2/index.html b/docs/tags/notes/page/2/index.html index 79448fad0..68d54eb3a 100644 --- a/docs/tags/notes/page/2/index.html +++ b/docs/tags/notes/page/2/index.html @@ -15,7 +15,7 @@ - + @@ -86,18 +86,16 @@

    2018-07-01

    +
  • I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:

    $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
    -
    +
  • - +
  • During the mvn package stage on the 5.8 branch I kept getting issues with java running out of memory:

    There is insufficient memory for the Java Runtime Environment to continue.
    -
    +
  • + Read more → @@ -124,23 +122,23 @@
  • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn’t build
  • I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
  • -
  • I proofed and tested the ILRI author corrections that Peter sent back to me this week:
  • - + +
  • I proofed and tested the ILRI author corrections that Peter sent back to me this week:

    $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
    -
    +
  • - +
  • I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018

  • + +
  • Time to index ~70,000 items on CGSpace:

    $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
     
     real    74m42.646s
     user    8m5.056s
     sys     2m7.289s
    -
    +
  • + Read more → @@ -264,26 +262,25 @@ sys 2m7.289s
  • I didn’t get any load alerts from Linode and the REST and XMLUI logs don’t show anything out of the ordinary
  • The nginx logs show HTTP 200s until 02/Jan/2018:11:27:17 +0000 when Uptime Robot got an HTTP 500
  • In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
  • -
  • And just before that I see this:
  • - + +
  • And just before that I see this:

    Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
    -
    +
  • - +
  • Ah hah! So the pool was actually empty!

  • + +
  • I need to increase that, let’s try to bump it up from 50 to 75

  • + +
  • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw

  • + +
  • I notice this error quite a few times in dspace.log:

    2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
     org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.
    -
    +
  • - +
  • And there are many of these errors every day for the past month:

    $ grep -c "Error while searching for sidebar facets" dspace.log.*
     dspace.log.2017-11-21:4
    @@ -329,10 +326,9 @@ dspace.log.2017-12-30:89
     dspace.log.2017-12-31:53
     dspace.log.2018-01-01:45
     dspace.log.2018-01-02:34
    -
    +
  • - Read more → @@ -385,20 +381,18 @@ dspace.log.2018-01-02:34

    2017-11-02

    +
  • Today there have been no hits by CORE and no alerts from Linode (coincidence?)

    # grep -c "CORE" /var/log/nginx/access.log
     0
    -
    +
  • - +
  • Generate list of authors on CGSpace for Peter to go through and correct:

    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
     COPY 54701
    -
    +
  • + Read more → @@ -419,15 +413,14 @@ COPY 54701

    2017-10-01

    +
  • Peter emailed to point out that many items in the ILRI archive collection have multiple handles:

    http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
    -
    +
  • - Read more → diff --git a/docs/tags/notes/page/3/index.html b/docs/tags/notes/page/3/index.html index c8e99f44a..bc7709c65 100644 --- a/docs/tags/notes/page/3/index.html +++ b/docs/tags/notes/page/3/index.html @@ -15,7 +15,7 @@ - + @@ -228,11 +228,12 @@ + +
  • Testing the CMYK patch on a collection with 650 items:

    $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
    -
    +
  • + Read more → @@ -267,12 +268,13 @@
  • Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
  • Filed an issue on DSpace issue tracker for the filter-media bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
  • Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
  • -
  • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 1056851999):
  • - + +
  • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 1056851999):

    $ identify ~/Desktop/alc_contrastes_desafios.jpg
     /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
    -
    +
  • + Read more → @@ -293,23 +295,22 @@

    2017-02-07

    +
  • An item was mapped twice erroneously again, so I had to remove one of the mappings manually:

    dspace=# select * from collection2item where item_id = '80278';
    -  id   | collection_id | item_id
    +id   | collection_id | item_id
     -------+---------------+---------
    - 92551 |           313 |   80278
    - 92550 |           313 |   80278
    - 90774 |          1051 |   80278
    +92551 |           313 |   80278
    +92550 |           313 |   80278
    +90774 |          1051 |   80278
     (3 rows)
     dspace=# delete from collection2item where id = 92551 and item_id = 80278;
     DELETE 1
    -
    +
  • - Read more → @@ -356,20 +357,21 @@ DELETE 1 + +
  • While looking in the logs for errors, I see tons of warnings about Atmire MQM:

    2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
    -
    +
  • - Read more → diff --git a/docs/tags/notes/page/4/index.html b/docs/tags/notes/page/4/index.html index fa61974fb..6fe3bf1b2 100644 --- a/docs/tags/notes/page/4/index.html +++ b/docs/tags/notes/page/4/index.html @@ -15,7 +15,7 @@ - + @@ -117,11 +117,12 @@
  • ORCIDs only
  • ORCIDs plus normal authors
  • -
  • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
  • - + +
  • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:

    0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
    -
    +
  • + Read more → @@ -145,11 +146,12 @@
  • Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors
  • Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace
  • We had been using DC=ILRI to determine whether a user was ILRI or not
  • -
  • It looks like we might be able to use OUs now, instead of DCs:
  • - + +
  • It looks like we might be able to use OUs now, instead of DCs:

    $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
    -
    +
  • + Read more → @@ -175,13 +177,14 @@
  • Bootstrap is at 3.3.0 but upstream is at 3.3.7, and upgrading to anything beyond 3.3.1 breaks glyphicons and probably more
  • bower stuff is a dead end, waste of time, too many issues
  • Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of fonts)
  • -
  • Start working on DSpace 5.1 → 5.5 port:
  • - + +
  • Start working on DSpace 5.1 → 5.5 port:

    $ git checkout -b 55new 5_x-prod
     $ git reset --hard ilri/5_x-prod
     $ git rebase -i dspace-5.5
    -
    +
  • + Read more → @@ -203,19 +206,18 @@ $ git rebase -i dspace-5.5 + +
  • I think this query should find and replace all authors that have “,” at the end of their names:

    dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
     UPDATE 95
     dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
    - text_value
    +text_value
     ------------
     (0 rows)
    -
    +
  • - Read more → @@ -266,12 +268,13 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and + +
  • There are 3,000 IPs accessing the REST API in a 24-hour period!

    # awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
     3168
    -
    +
  • + Read more → diff --git a/docs/tags/notes/page/5/index.html b/docs/tags/notes/page/5/index.html index 64674ab8e..ee187e6dc 100644 --- a/docs/tags/notes/page/5/index.html +++ b/docs/tags/notes/page/5/index.html @@ -15,7 +15,7 @@ - + @@ -110,15 +110,15 @@

    2015-12-02

    +
  • Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less space:

    # cd /home/dspacetest.cgiar.org/log
     # ls -lh dspace.log.2015-11-18*
     -rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
     -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
     -rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz
    -
    +
  • + Read more → @@ -141,12 +141,13 @@ + +
  • Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:

    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
     78
    -
    +
  • + Read more → diff --git a/docs/tags/page/2/index.html b/docs/tags/page/2/index.html index c908a545c..91e26b71d 100644 --- a/docs/tags/page/2/index.html +++ b/docs/tags/page/2/index.html @@ -15,7 +15,7 @@ - + @@ -101,18 +101,16 @@

    2018-07-01

    +
  • I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:

    $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
    -
    +
  • - +
  • During the mvn package stage on the 5.8 branch I kept getting issues with java running out of memory:

    There is insufficient memory for the Java Runtime Environment to continue.
    -
    +
  • + Read more → @@ -139,23 +137,23 @@
  • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn’t build
  • I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
  • -
  • I proofed and tested the ILRI author corrections that Peter sent back to me this week:
  • - + +
  • I proofed and tested the ILRI author corrections that Peter sent back to me this week:

    $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
    -
    +
  • - +
  • I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018

  • + +
  • Time to index ~70,000 items on CGSpace:

    $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
     
     real    74m42.646s
     user    8m5.056s
     sys     2m7.289s
    -
    +
  • + Read more → @@ -279,26 +277,25 @@ sys 2m7.289s
  • I didn’t get any load alerts from Linode and the REST and XMLUI logs don’t show anything out of the ordinary
  • The nginx logs show HTTP 200s until 02/Jan/2018:11:27:17 +0000 when Uptime Robot got an HTTP 500
  • In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
  • -
  • And just before that I see this:
  • - + +
  • And just before that I see this:

    Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
    -
    +
  • - +
  • Ah hah! So the pool was actually empty!

  • + +
  • I need to increase that, let’s try to bump it up from 50 to 75

  • + +
  • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw

  • + +
  • I notice this error quite a few times in dspace.log:

    2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
     org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.
    -
    +
  • - +
  • And there are many of these errors every day for the past month:

    $ grep -c "Error while searching for sidebar facets" dspace.log.*
     dspace.log.2017-11-21:4
    @@ -344,10 +341,9 @@ dspace.log.2017-12-30:89
     dspace.log.2017-12-31:53
     dspace.log.2018-01-01:45
     dspace.log.2018-01-02:34
    -
    +
  • - Read more → @@ -400,20 +396,18 @@ dspace.log.2018-01-02:34

    2017-11-02

    +
  • Today there have been no hits by CORE and no alerts from Linode (coincidence?)

    # grep -c "CORE" /var/log/nginx/access.log
     0
    -
    +
  • - +
  • Generate list of authors on CGSpace for Peter to go through and correct:

    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
     COPY 54701
    -
    +
  • + Read more → @@ -434,15 +428,14 @@ COPY 54701

    2017-10-01

    +
  • Peter emailed to point out that many items in the ILRI archive collection have multiple handles:

    http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
    -
    +
  • - Read more → diff --git a/docs/tags/page/3/index.html b/docs/tags/page/3/index.html index 72e253d5d..6f51f78fa 100644 --- a/docs/tags/page/3/index.html +++ b/docs/tags/page/3/index.html @@ -15,7 +15,7 @@ - + @@ -261,11 +261,12 @@ + +
  • Testing the CMYK patch on a collection with 650 items:

    $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
    -
    +
  • + Read more → @@ -300,12 +301,13 @@
  • Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
  • Filed an issue on DSpace issue tracker for the filter-media bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
  • Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
  • -
  • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 1056851999):
  • - + +
  • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 1056851999):

    $ identify ~/Desktop/alc_contrastes_desafios.jpg
     /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
    -
    +
  • + Read more → @@ -326,23 +328,22 @@

    2017-02-07

    +
  • An item was mapped twice erroneously again, so I had to remove one of the mappings manually:

    dspace=# select * from collection2item where item_id = '80278';
    -  id   | collection_id | item_id
    +id   | collection_id | item_id
     -------+---------------+---------
    - 92551 |           313 |   80278
    - 92550 |           313 |   80278
    - 90774 |          1051 |   80278
    +92551 |           313 |   80278
    +92550 |           313 |   80278
    +90774 |          1051 |   80278
     (3 rows)
     dspace=# delete from collection2item where id = 92551 and item_id = 80278;
     DELETE 1
    -
    +
  • - Read more → diff --git a/docs/tags/page/4/index.html b/docs/tags/page/4/index.html index d2e7c6cf2..abe4be5ec 100644 --- a/docs/tags/page/4/index.html +++ b/docs/tags/page/4/index.html @@ -15,7 +15,7 @@ - + @@ -102,20 +102,21 @@ + +
  • While looking in the logs for errors, I see tons of warnings about Atmire MQM:

    2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
    -
    +
  • - Read more → @@ -168,11 +169,12 @@
  • ORCIDs only
  • ORCIDs plus normal authors
  • -
  • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
  • - + +
  • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:

    0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
    -
    +
  • + Read more → @@ -196,11 +198,12 @@
  • Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors
  • Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace
  • We had been using DC=ILRI to determine whether a user was ILRI or not
  • -
  • It looks like we might be able to use OUs now, instead of DCs:
  • - + +
  • It looks like we might be able to use OUs now, instead of DCs:

    $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
    -
    +
  • + Read more → @@ -226,13 +229,14 @@
  • Bootstrap is at 3.3.0 but upstream is at 3.3.7, and upgrading to anything beyond 3.3.1 breaks glyphicons and probably more
  • bower stuff is a dead end, waste of time, too many issues
  • Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of fonts)
  • -
  • Start working on DSpace 5.1 → 5.5 port:
  • - + +
  • Start working on DSpace 5.1 → 5.5 port:

    $ git checkout -b 55new 5_x-prod
     $ git reset --hard ilri/5_x-prod
     $ git rebase -i dspace-5.5
    -
    +
  • + Read more → @@ -254,19 +258,18 @@ $ git rebase -i dspace-5.5 + +
  • I think this query should find and replace all authors that have “,” at the end of their names:

    dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
     UPDATE 95
     dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
    - text_value
    +text_value
     ------------
     (0 rows)
    -
    +
  • - Read more → @@ -317,12 +320,13 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and + +
  • There are 3,000 IPs accessing the REST API in a 24-hour period!

    # awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
     3168
    -
    +
  • + Read more → diff --git a/docs/tags/page/5/index.html b/docs/tags/page/5/index.html index 52d8bdaa2..6150d9d38 100644 --- a/docs/tags/page/5/index.html +++ b/docs/tags/page/5/index.html @@ -15,7 +15,7 @@ - + @@ -156,15 +156,15 @@

    2015-12-02

    +
  • Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less space:

    # cd /home/dspacetest.cgiar.org/log
     # ls -lh dspace.log.2015-11-18*
     -rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
     -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
     -rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz
    -
    +
  • + Read more → @@ -187,12 +187,13 @@ + +
  • Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:

    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
     78
    -
    +
  • + Read more →