diff --git a/content/posts/2019-11.md b/content/posts/2019-11.md index b9c00da9f..0ef4fcdb5 100644 --- a/content/posts/2019-11.md +++ b/content/posts/2019-11.md @@ -444,7 +444,7 @@ Buck/2.2; (+https://app.hypefactors.com/media-monitoring/about.html) ## 2019-11-26 -- Visit CodeObie to discuss future of OpenRXV and AReS +- Visit CodeObia to discuss future of OpenRXV and AReS - I started working on categorizing and validating the feedback that Jane collated into a spreadsheet last week - I added GitHub issues for eight of the items so far, tagging them by "bug", "search", "feature", "graphics", "low-priority", etc - I moved AReS v2 to be available on CGSpace @@ -465,4 +465,12 @@ Buck/2.2; (+https://app.hypefactors.com/media-monitoring/about.html) - I need to ask Marie-Angelique about the `cg.peer-reviewed` field - We currently use `dc.description.version` with values like "Internal Review" and "Peer Review", and CG Core v2 currently recommends using "True" if the field is peer reviewed +## 2019-11-28 + +- File an issue with CG Core v2 project to ask Marie-Angelique about expanding the scope of `cg.peer-reviewed` to include other types of review, and possibly to change the field name to something more generic like `cg.review-status` ([#14](https://github.com/AgriculturalSemantics/cg-core/issues/14)) +- More review of AReS feedback + - I clarified some of the feedback + - I added status of "Issue Filed", "Duplicate" and "No Action Required" to several items + - I filed a handful more GitHub issues in AReS and OpenRXV GitHub trackers + diff --git a/docs/2015-11/index.html b/docs/2015-11/index.html index 5e2736a13..b00eeecd0 100644 --- a/docs/2015-11/index.html +++ b/docs/2015-11/index.html @@ -8,15 +8,12 @@ @@ -27,17 +24,14 @@ $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspac - + @@ -118,147 +112,107 @@ $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspac

-

2015-11-22

- +

2015-11-22

$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
 78
-
- - - - - - -

2015-12-05

- +

2015-12-05

postgres@linode01:~$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
 28
-
- -
  • I have reverted all the pgtune tweaks from the other day, as they didn’t fix the stability issues, so I’d rather not have them introducing more variables into the equation

  • - -
  • The PostgreSQL stats from Munin all point to something database-related with the DSpace 5 upgrade around mid–late November

  • + - -

    PostgreSQL bgwriter (year) -PostgreSQL cache (year) -PostgreSQL locks (year) -PostgreSQL scans (year)

    - -

    2015-12-07

    - +

    PostgreSQL bgwriter (year) +PostgreSQL cache (year) +PostgreSQL locks (year) +PostgreSQL scans (year)

    +

    2015-12-07

    $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
     0.675
     $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
    @@ -267,14 +231,10 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
     0.566
     $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
     0.497
    -
    - - -

    2015-12-08

    - +

    2015-12-08

    diff --git a/docs/2016-01/index.html b/docs/2016-01/index.html index 646718eb6..534af441c 100644 --- a/docs/2016-01/index.html +++ b/docs/2016-01/index.html @@ -8,7 +8,6 @@ - + @@ -108,90 +106,72 @@ Update GitHub wiki for documentation of maintenance tasks.

    -

    2016-01-13

    - +

    2016-01-13

    - -

    2016-01-14

    - +

    2016-01-14

    - -

    2016-01-18

    - +

    2016-01-18

    - -

    2016-01-19

    - +

    2016-01-19

    - -

    2016-01-21

    - +

    2016-01-21

    - -

    2016-01-25

    - +

    2016-01-25

    - -

    2016-01-26

    - +

    2016-01-26

    - -

    2016-01-28

    - +

    2016-01-28

    - -

    2016-01-29

    - +

    2016-01-29

    - -

    XMLUI subjects before

    - +

    XMLUI subjects before

    - -

    XMLUI subjects after

    +

    XMLUI subjects after

    diff --git a/docs/2016-02/index.html b/docs/2016-02/index.html index e266e58ad..d01748aae 100644 --- a/docs/2016-02/index.html +++ b/docs/2016-02/index.html @@ -8,15 +8,12 @@ @@ -29,19 +26,16 @@ Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE&r - + @@ -122,71 +116,53 @@ Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE&r

    -

    2016-02-05

    - +

    2016-02-05

    - -

    CGSpace country list

    - +

    CGSpace country list

    - -

    2016-02-06

    - +

    2016-02-06

    dspacetest=# select * from metadatafieldregistry;
    -
    - -
  • In this case our country field is 78

  • - -
  • Now find all resources with type 2 (item) that have null/empty values for that field:

    - +
    dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value='' OR text_value IS NULL);
    -
  • - -
  • Then you can find the handle that owns it from its resource_id:

    - +
    dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678';
    -
  • - -
  • It’s 25 items so editing in the web UI is annoying, let’s try SQL!

    - +
    dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value='';
     DELETE 25
    -
  • - -
  • After that perhaps a regular dspace index-discovery (no -b) should suffice…

  • - -
  • Hmm, I indexed, cleared the Cocoon cache, and restarted Tomcat but the 25 “|||” countries are still there

  • - -
  • Maybe I need to do a full re-index…

  • - -
  • Yep! The full re-index seems to work.

  • - -
  • Process the empty countries on CGSpace

  • + - -

    2016-02-07

    - +

    2016-02-07

    $ postgres -D /opt/brew/var/postgres
     $ createuser --superuser postgres
     $ createuser --pwprompt dspacetest
    @@ -200,10 +176,9 @@ postgres=# alter user dspacetest nocreateuser;
     postgres=# \q
     $ vacuumdb dspacetest
     $ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest -h localhost
    -
    - -
  • After building and running a fresh_install I symlinked the webapps into Tomcat’s webapps folder:

    - +
    $ mv /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT.orig
     $ ln -sfv ~/dspace/webapps/xmlui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT
     $ ln -sfv ~/dspace/webapps/rest /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/rest
    @@ -211,39 +186,28 @@ $ ln -sfv ~/dspace/webapps/jspui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/
     $ ln -sfv ~/dspace/webapps/oai /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/oai
     $ ln -sfv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/solr
     $ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start
    -
  • - -
  • Add CATALINA_OPTS in /opt/brew/Cellar/tomcat/8.0.30/libexec/bin/setenv.sh, as this script is sourced by the catalina startup script

  • - -
  • For example:

    - -
    CATALINA_OPTS="-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8"
    -
  • - -
  • After verifying that the site is working, start a full index:

    - -
    $ ~/dspace/bin/dspace index-discovery -b
    -
  • + - -

    2016-02-08

    - +
    CATALINA_OPTS="-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8"
    +
    +
    $ ~/dspace/bin/dspace index-discovery -b
    +

    2016-02-08

    - -

    ILRI submission buttons -Drylands submission buttons

    - -

    2016-02-09

    - +

    ILRI submission buttons +Drylands submission buttons

    +

    2016-02-09

    $ cd ~/src/git
     $ git clone https://github.com/letsencrypt/letsencrypt
     $ cd letsencrypt
    @@ -252,51 +216,39 @@ $ sudo service nginx stop
     $ ./letsencrypt-auto certonly --standalone -d dspacetest.cgiar.org
     $ sudo service nginx start
     $ ansible-playbook dspace.yml -l linode02 -t nginx,firewall -u aorth --ask-become-pass
    -
    - -
  • We should install it in /opt/letsencrypt and then script the renewal script, but first we have to wire up some variables and template stuff based on the script here: https://letsencrypt.org/howitworks/

  • - -
  • I had to export some CIAT items that were being cleaned up on the test server and I noticed their dc.contributor.author fields have DSpace 5 authority index UUIDs…

  • - -
  • To clean those up in OpenRefine I used this GREL expression: value.replace(/::\w{8}-\w{4}-\w{4}-\w{4}-\w{12}::600/,"")

  • - -
  • Getting more and more hangs on DSpace Test, seemingly random but also during CSV import

  • - -
  • Logs don’t always show anything right when it fails, but eventually one of these appears:

    - +
    org.dspace.discovery.SearchServiceException: Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space
    -
  • - -
  • or

    - +
    Caused by: java.util.NoSuchElementException: Timeout waiting for idle object
    -
  • - -
  • Right now DSpace Test’s Tomcat heap is set to 1536m and we have quite a bit of free RAM:

    - +
    # free -m
    -         total       used       free     shared    buffers     cached
    +             total       used       free     shared    buffers     cached
     Mem:          3950       3902         48          9         37       1311
     -/+ buffers/cache:       2552       1397
     Swap:          255         57        198
    -
  • - -
  • So I’ll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)

  • + - -

    2016-02-11

    - +

    2016-02-11

    value.split('/')[-1]
    -
    - -
  • Then I wrote a tool called generate-thumbnails.py to download the PDFs and generate thumbnails for them, for example:

    - +
    $ ./generate-thumbnails.py ciat-reports.csv
     Processing 64661.pdf
     > Downloading 64661.pdf
    @@ -304,138 +256,99 @@ Processing 64661.pdf
     Processing 64195.pdf
     > Downloading 64195.pdf
     > Creating thumbnail for 64195.pdf
    -
  • - - -

    2016-02-12

    - +

    2016-02-12

    - -

    2016-02-12

    - +

    2016-02-12

    $ ls | grep -c -E "%"
     265
    -
    - -
  • I suggest that we import ~850 or so of the clean ones first, then do the rest after I can find a clean/reliable way to decode the filenames

  • - -
  • This python2 snippet seems to work in the CLI, but not so well in OpenRefine:

    - +
    $ python -c "import urllib, sys; print urllib.unquote(sys.argv[1])" CIAT_COLOMBIA_000169_T%C3%A9cnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
     CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
    -
  • - -
  • Merge pull requests for submission form theming (#178) and missing center subjects in XMLUI item views (#176)

  • - -
  • They will be deployed on CGSpace the next time I re-deploy

  • + - -

    2016-02-16

    - +

    2016-02-16

    - -

    2016-02-17

    - +
    value.unescape("url")
    +
    +

    2016-02-17

    - -

    2016-02-20

    - +

    2016-02-20

    - -

    2016-02-22

    - +
    java.io.FileNotFoundException: /usr/share/tomcat7/SimpleArchiveFormat/item_1021/CIAT_COLOMBIA_000075_Medición_de_palatabilidad_en_forrajes.pdf (No such file or directory)
    +
    +

    2016-02-22

    value.replace('ó','o').replace('í','i').replace('á','a').replace('é','e').replace('ñ','n')
    -
    - -
  • But actually, the accents might not be an issue, as I can successfully import files containing Spanish accents on my Mac

  • - -
  • On closer inspection, I can import files with the following names on Linux (DSpace Test):

    - +
    Bitstream: tést.pdf
     Bitstream: tést señora.pdf
     Bitstream: tést señora alimentación.pdf
    -
  • - -
  • Seems it could be something with the HFS+ filesystem actually, as it’s not UTF-8 (it’s something like UCS-2)

  • - -
  • HFS+ stores filenames as a string, and filenames with accents get stored as character+accent whereas Linux’s ext4 stores them as an array of bytes

  • - -
  • Running the SAFBuilder on Mac OS X works if you’re going to import the resulting bundle on Mac OS X, but if your DSpace is running on Linux you need to run the SAFBuilder there where the filesystem’s encoding matches

  • + - -

    2016-02-29

    - +

    2016-02-29

    value.replace("'",'').replace('_=_','_').replace(',','').replace('[','').replace(']','').replace('(','').replace(')','').replace('_.pdf','.pdf').replace('._','_')
    -
    - -
  • Finally import the 1127 CIAT items into CGSpace: https://cgspace.cgiar.org/handle/10568/35710

  • - -
  • Re-deploy CGSpace with the Google Scholar fix, but I’m waiting on the Atmire fixes for now, as the branch history is ugly

  • + diff --git a/docs/2016-03/index.html b/docs/2016-03/index.html index 0b6c19319..708809b51 100644 --- a/docs/2016-03/index.html +++ b/docs/2016-03/index.html @@ -8,9 +8,8 @@ @@ -22,12 +21,11 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja - + @@ -108,112 +106,86 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja

    -

    2016-03-02

    - +

    2016-03-02

    - -

    2016-03-07

    - +

    2016-03-07

    - -

    2016-03-08

    - +
    Exception in thread "Lucene Merge Thread #19" org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device
    +

    2016-03-08

    - -

    2016-03-10

    - +

    2016-03-10

    - -

    Mixed up label in Atmire CUA

    - +

    Mixed up label in Atmire CUA

    - -

    2016-03-11

    - +

    2016-03-11

    - -

    2016-03-14

    - +

    2016-03-14

    - -

    Missing XMLUI string

    - -

    2016-03-15

    - +

    Missing XMLUI string

    +

    2016-03-15

    - -

    2016-03-16

    - +

    2016-03-16

    - -

    2016-03-17

    - +
    # select * from metadatavalue where metadata_field_id=37;
    + metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
    +-------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------
    +           1942571 |       35342 |                37 | hi         |           |     1 |           |         -1 |                2
    +           1942468 |       35345 |                37 | hi         |           |     1 |           |         -1 |                2
    +           1942479 |       35337 |                37 | hi         |           |     1 |           |         -1 |                2
    +           1942505 |       35336 |                37 | hi         |           |     1 |           |         -1 |                2
    +           1942519 |       35338 |                37 | hi         |           |     1 |           |         -1 |                2
    +           1942535 |       35340 |                37 | hi         |           |     1 |           |         -1 |                2
    +           1942555 |       35341 |                37 | hi         |           |     1 |           |         -1 |                2
    +           1942588 |       35343 |                37 | hi         |           |     1 |           |         -1 |                2
    +           1942610 |       35346 |                37 | hi         |           |     1 |           |         -1 |                2
    +           1942624 |       35347 |                37 | hi         |           |     1 |           |         -1 |                2
    +           1942639 |       35339 |                37 | hi         |           |     1 |           |         -1 |                2
    +
    +

    2016-03-17

    - -

    2016-03-18

    - +

    2016-03-18

    - -

    Excessive whitespace in thumbnail

    - +

    Excessive whitespace in thumbnail

    - -

    Trimmed thumbnail

    - +

    Trimmed thumbnail

    - -

    2016-03-21

    - +
    $ gm convert -trim -quality 82 -thumbnail x300 -flatten Descriptor\ for\ Butia_EN-2015_2021.pdf\[0\] cover.jpg
    +
    +

    2016-03-21

    +
  • I will mark these errors as resolved because they are returning HTTP 403 on purpose, for a long time!
  • Google says the first time it saw this particular error was September 29, 2015… so maybe it accidentally saw it somehow…
  • On a related note, we have 51,000 items indexed from the sitemap, but 500,000 items in the Google index, so we DEFINITELY have a problem with duplicate content
  • - -

    CGSpace pages in Google index

    - +

    CGSpace pages in Google index

    - -

    URL parameters cause millions of dynamic pages -Setting pages with the filter_0 param not to show in search results

    - +

    URL parameters cause millions of dynamic pages +Setting pages with the filter_0 param not to show in search results

    - -

    2016-03-22

    - +

    2016-03-22

    - -

    2016-03-23

    - +

    2016-03-23

    - -

    2016-03-24

    - +
    Can't find method org.dspace.app.xmlui.aspect.administrative.FlowGroupUtils.processSaveGroup(org.dspace.core.Context,number,string,[Ljava.lang.String;,[Ljava.lang.String;,org.apache.cocoon.environment.wrapper.RequestWrapper). (resource://aspects/Administrative/administrative.js#967)
    +
    +

    2016-03-24

    - -

    2016-03-25

    - +

    2016-03-25

    - -

    2016-03-28

    - +

    2016-03-28

    - -

    2016-03-29

    - +

    2016-03-29

    + +
  • Test metadata migration on local instance again:
  • +
    $ ./migrate-fields.sh
     UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66
     UPDATE 40885
    @@ -335,98 +287,80 @@ UPDATE 3872
     UPDATE metadatavalue SET metadata_field_id=217 WHERE metadata_field_id=108
     UPDATE 46075
     $ JAVA_OPTS="-Xms512m -Xmx512m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace index-discovery -bf
    -
    - -
  • CGSpace was down but I’m not sure why, this was in catalina.out:

    - +
    Apr 18, 2016 7:32:26 PM com.sun.jersey.spi.container.ContainerResponse logException
     SEVERE: Mapped exception to response: 500 (Internal Server Error)
     javax.ws.rs.WebApplicationException
    -    at org.dspace.rest.Resource.processFinally(Resource.java:163)
    -    at org.dspace.rest.HandleResource.getObject(HandleResource.java:81)
    -    at sun.reflect.GeneratedMethodAccessor198.invoke(Unknown Source)
    -    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -    at java.lang.reflect.Method.invoke(Method.java:606)
    -    at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
    -    at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
    -    at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
    -    at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
    -    at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
    -    at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
    -    at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
    -    at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
    -    at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1511)
    -    at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1442)
    -    at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1391)
    -    at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1381)
    -    at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
    +        at org.dspace.rest.Resource.processFinally(Resource.java:163)
    +        at org.dspace.rest.HandleResource.getObject(HandleResource.java:81)
    +        at sun.reflect.GeneratedMethodAccessor198.invoke(Unknown Source)
    +        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    +        at java.lang.reflect.Method.invoke(Method.java:606)
    +        at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
    +        at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
    +        at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
    +        at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
    +        at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
    +        at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
    +        at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
    +        at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
    +        at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1511)
    +        at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1442)
    +        at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1391)
    +        at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1381)
    +        at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
     ...
    -
  • - -
  • Everything else in the system looked normal (50GB disk space available, nothing weird in dmesg, etc)

  • - -
  • After restarting Tomcat a few more of these errors were logged but the application was up

  • + - -

    2016-04-19

    - +

    2016-04-19

    # select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=105);
    -handle
    +   handle
     -------------
    -10568/10298
    -10568/16413
    -10568/16774
    -10568/34487
    -
    - -
  • Delete metadata values for dc.GRP and dc.icsubject.icrafsubject:

    - + 10568/10298 + 10568/16413 + 10568/16774 + 10568/34487 +
    # delete from metadatavalue where resource_type_id=2 and metadata_field_id=96;
     # delete from metadatavalue where resource_type_id=2 and metadata_field_id=83;
    -
  • - -
  • They are old ICRAF fields and we haven’t used them since 2011 or so

  • - -
  • Also delete them from the metadata registry

  • - -
  • CGSpace went down again, dspace.log had this:

    - +
    2016-04-19 15:02:17,025 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
     org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
    -
  • - -
  • I restarted Tomcat and PostgreSQL and now it’s back up

  • - -
  • I bet this is the same crash as yesterday, but I only saw the errors in catalina.out

  • - -
  • Looks to be related to this, from dspace.log:

    - +
    2016-04-19 15:16:34,670 ERROR org.dspace.rest.Resource @ Something get wrong. Aborting context in finally statement.
    -
  • - -
  • We have 18,000 of these errors right now…

  • - -
  • Delete a few more old metadata values: dc.Species.animal, dc.type.journal, and dc.publicationcategory:

    - +
    # delete from metadatavalue where resource_type_id=2 and metadata_field_id=105;
     # delete from metadatavalue where resource_type_id=2 and metadata_field_id=85;
     # delete from metadatavalue where resource_type_id=2 and metadata_field_id=95;
    -
  • - -
  • And then remove them from the metadata registry

  • + - -

    2016-04-20

    - +

    2016-04-20

    $ ./migrate-fields.sh
     UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66
     UPDATE 40909
    @@ -440,62 +374,47 @@ UPDATE metadatavalue SET metadata_field_id=215 WHERE metadata_field_id=106
     UPDATE 3872
     UPDATE metadatavalue SET metadata_field_id=217 WHERE metadata_field_id=108
     UPDATE 46075
    -
    - -
  • Also, I migrated CGSpace to using the PGDG PostgreSQL repo as the infrastructure playbooks had been using it for a while and it seemed to be working well

  • - -
  • Basically, this gives us the ability to use the latest upstream stable 9.3.x release (currently 9.3.12)

  • - -
  • Looking into the REST API errors again, it looks like these started appearing a few days ago in the tens of thousands:

    - +
    $ grep -c "Aborting context in finally statement" dspace.log.2016-04-20
     21252
    -
  • - -
  • I found a recent discussion on the DSpace mailing list and I’ve asked for advice there

  • - -
  • Looks like this issue was noted and fixed in DSpace 5.5 (we’re on 5.1): https://jira.duraspace.org/browse/DS-2936

  • - -
  • I’ve sent a message to Atmire asking about compatibility with DSpace 5.5

  • + - -

    2016-04-21

    - +

    2016-04-21

    - -

    2016-04-22

    - +

    2016-04-22

    - -

    2016-04-26

    - +

    2016-04-26

    - -

    2016-04-27

    - + + +

    2016-04-27

    # grep -c "Aborting context in finally statement" dspace.log.2016-04-*
     dspace.log.2016-04-01:0
     dspace.log.2016-04-02:0
    @@ -524,40 +443,29 @@ dspace.log.2016-04-24:28775
     dspace.log.2016-04-25:28626
     dspace.log.2016-04-26:28655
     dspace.log.2016-04-27:7271
    -
    - -
  • I restarted tomcat and it is back up

  • - -
  • Add Spanish XMLUI strings so those users see “CGSpace” instead of “DSpace” in the user interface (#222)

  • - -
  • Submit patch to upstream DSpace for the misleading help text in the embargo step of the item submission: https://jira.duraspace.org/browse/DS-3172

  • - -
  • Update infrastructure playbooks for nginx 1.10.x (stable) release: https://github.com/ilri/rmg-ansible-public/issues/32

  • - -
  • Currently running on DSpace Test, we’ll give it a few days before we adjust CGSpace

  • - -
  • CGSpace down, restarted tomcat and it’s back up

  • + - -

    2016-04-28

    - +

    2016-04-28

    - -

    2016-04-30

    - +

    2016-04-30

    location /rest {
     	access_log /var/log/nginx/rest.log;
     	proxy_pass http://127.0.0.1:8443;
     }
    -
    - -
  • I will check the logs again in a few days to look for patterns, see who is accessing it, etc

  • + diff --git a/docs/2016-05/index.html b/docs/2016-05/index.html index a0f1a900f..1f51b8e45 100644 --- a/docs/2016-05/index.html +++ b/docs/2016-05/index.html @@ -8,15 +8,12 @@ @@ -27,17 +24,14 @@ There are 3,000 IPs accessing the REST API in a 24-hour period! - + @@ -118,52 +112,38 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!

    -

    2016-05-01

    - +

    2016-05-01

    # awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
     3168
    -
    - - - - -

    2016-05-11

    - +

    2016-05-11

    - -

    2016-05-12

    - + +
  • +

    Start a test rebase of the 5_x-prod branch on top of the dspace-5.5 tag

    +
  • +
  • +

    There were a handful of conflicts that I didn't understand

    +
  • +
  • +

    After completing the rebase I tried to build with the module versions Atmire had indicated as being 5.5 ready but I got this error:

    +
  • + +
    [ERROR] Failed to execute goal on project additions: Could not resolve dependencies for project org.dspace.modules:additions:jar:5.5: Could not find artifact com.atmire:atmire-metadata-quality-api:jar:5.5-2.10.1-0 in sonatype-releases (https://oss.sonatype.org/content/repositories/releases/) -> [Help 1]
    +
    +

    2016-05-12

    +
  • Questions for CG people: -
  • - -
  • Found ~200 messed up CIAT values in dc.publisher:

    - -
    # select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=39 and text_value similar to "%  %";
    -
  • - -

    2016-05-13

    - + +
  • Found ~200 messed up CIAT values in dc.publisher:
  • + +
    # select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=39 and text_value similar to "%  %";
    +

    2016-05-13

    - -

    2016-05-18

    - + +
  • dc.place is our own field, so it's easy to move
  • +
  • I've removed dc.title.jtitle from the list for now because there's no use moving it out of DC until we know where it will go (see discussion yesterday)
  • + +

    2016-05-18

    if(cells['thumbnails'].value.contains('hqdefault'), cells['thumbnails'].value.split('/')[-2] + '.jpg', cells['thumbnails'].value.split('/')[-1])
    -
    - -
  • Because ~400 records had the same filename on Flickr (hqdefault.jpg) but different UUIDs in the URL

  • - -
  • So for the hqdefault.jpg ones I just take the UUID (-2) and use it as the filename

  • - -
  • Before importing with SAFBuilder I tested adding “__bundle:THUMBNAIL” to the filename column and it works fine

  • + - -

    2016-05-19

    - +

    2016-05-19

    value.replace('_','').replace('-','')
    -
    - -
  • We need to hold off on moving dc.Species to cg.species because it is only used for plants, and might be better to move it to something like cg.species.plant

  • - -
  • And dc.identifier.fund is MOSTLY used for CPWF project identifier but has some other sponsorship things

    - + - -

    2016-05-20

    - +
  • + +
    # select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
    +

    2016-05-20

    - -

    2016-05-23

    - +
    value + "__bundle:THUMBNAIL"
    +
    +
    value.replace(/\u0081/,'')
    +
    +

    2016-05-23

    - -

    2016-05-30

    - +

    2016-05-30

    $ mkdir ~/ccafs-images
     $ /home/dspacetest.cgiar.org/bin/dspace export -t COLLECTION -i 10568/79355 -d ~/ccafs-images -n 0 -m
    -
    - -
  • And then import to CGSpace:

    - +
    $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/70974 --source /tmp/ccafs-images --mapfile=/tmp/ccafs-images-may30.map &> /tmp/ccafs-images-may30.log
    -
  • - -
  • But now we have double authors for “CGIAR Research Program on Climate Change, Agriculture and Food Security” in the authority

  • - -
  • I’m trying to do a Discovery index before messing with the authority index

  • - -
  • Looks like we are missing the index-authority cron job, so who knows what’s up with our authority index

  • - -
  • Run system updates on DSpace Test, re-deploy code, and reboot the server

  • - -
  • Clean up and import ~200 CTA records to CGSpace via CSV like:

    - +
    $ export JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8"
     $ /home/cgspace.cgiar.org/bin/dspace metadata-import -e aorth@mjanja.ch -f ~/CTA-May30/CTA-42229.csv &> ~/CTA-May30/CTA-42229.log
    -
  • - -
  • Discovery indexing took a few hours for some reason, and after that I started the index-authority script

    - -
    $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace index-authority
    -
  • + - -

    2016-05-31

    - +
    $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace index-authority
    +

    2016-05-31

    $ time /home/cgspace.cgiar.org/bin/dspace index-authority
     Retrieving all data
     Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer
    @@ -405,17 +336,12 @@ All done !
     real    37m26.538s
     user    2m24.627s
     sys     0m20.540s
    -
    - -
  • Update tomcat7 crontab on CGSpace and DSpace Test to have the index-authority script that we were missing

  • - -
  • Add new ILRI subject and CCAFS project tags to input-forms.xml (#226, #225)

  • - -
  • Manually mapped the authors of a few old CCAFS records to the new CCAFS authority UUID and re-indexed authority indexes to see if it helps correct those items.

  • - -
  • Re-sync DSpace Test data with CGSpace

  • - -
  • Clean up and import ~65 more CTA items into CGSpace

  • + diff --git a/docs/2016-06/index.html b/docs/2016-06/index.html index b641a05f1..22cdb51cb 100644 --- a/docs/2016-06/index.html +++ b/docs/2016-06/index.html @@ -8,9 +8,8 @@ - + @@ -114,300 +112,240 @@ Working on second phase of metadata migration, looks like this will work for mov

    -

    2016-06-01

    - +

    2016-06-01

    -
    dspacetest=# update metadatavalue set metadata_field_id=130 where metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
     UPDATE 497
     dspacetest=# update metadatavalue set metadata_field_id=29 where metadata_field_id=75;
     UPDATE 14
    -
    - - +

    2016-06-20

    # /opt/letsencrypt/letsencrypt-auto renew --standalone --pre-hook "/usr/bin/service nginx stop" --post-hook "/usr/bin/service nginx start"
    -
    - -
  • I really need to fix that cron job…

  • + - -

    2016-06-24

    - +

    2016-06-24

    $ ./fix-metadata-values.py -i investors-not-blank-not-delete-85.csv -f dc.description.sponsorship -t 'correct investor' -m 29 -d cgspace -p 'fuuu' -u cgspace
     $ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.sponsorship -m 29 -d cgspace -p 'fuuu' -u cgspace
    -
    - -
  • The scripts for this are here:

    - + - -

    2016-06-28

    - +
  • +
  • Add new sponsors to controlled vocabulary (#244)
  • +
  • Refine submission form labels and hints
  • + +

    2016-06-28

    - -

    2016-06-29

    - +
    dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=126 group by text_value order by count desc) to /tmp/contributors-june28.csv with csv;
    +
    +

    2016-06-29

    72  55  #dc.source
     86  230 #cg.contributor.crp
     91  211 #cg.contributor.affiliation
    @@ -418,40 +356,31 @@ $ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.spons
     74  220 #cg.identifier.doi
     79  222 #cg.identifier.googleurl
     89  223 #cg.identifier.dataurl
    -
    - -
  • Run all cleanups and deletions of dc.contributor.corporate on CGSpace:

    - +
    $ ./fix-metadata-values.py -i Corporate-Authors-Fix-121.csv -f dc.contributor.corporate -t 'Correct style' -m 126 -d cgspace -u cgspace -p 'fuuu'
     $ ./fix-metadata-values.py -i Corporate-Authors-Fix-PB.csv -f dc.contributor.corporate -t 'should be' -m 126 -d cgspace -u cgspace -p 'fuuu'
     $ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-Delete-13.csv -m 126 -u cgspace -d cgspace -p 'fuuu'
    -
  • - -
  • Re-deploy CGSpace and DSpace Test with latest June changes

  • - -
  • Now the sharing and Altmetric bits are more prominent:

  • + - -

    DSpace 5.1 XMLUI With Altmetric Badge

    - +

    DSpace 5.1 XMLUI With Altmetric Badge

    - -

    2016-06-30

    - +

    2016-06-30

    +
    # select text_value from  metadatavalue where metadata_field_id=3 and text_value like '%,';
    +
    +
    # update metadatavalue set text_value = regexp_replace(text_value, '(Poole, J),', '\1') where metadata_field_id=3 and text_value = 'Poole, J,';
    +
    diff --git a/docs/2016-07/index.html b/docs/2016-07/index.html index fdaacc057..990cafa19 100644 --- a/docs/2016-07/index.html +++ b/docs/2016-07/index.html @@ -8,19 +8,16 @@ @@ -32,22 +29,19 @@ In this case the select query was showing 95 results before the update - + @@ -128,67 +122,49 @@ In this case the select query was showing 95 results before the update

    -

    2016-07-01

    - +

    2016-07-01

    dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
     UPDATE 95
     dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
    -text_value
    + text_value
     ------------
     (0 rows)
    -
    - -
  • In this case the select query was showing 95 results before the update

  • + - -

    2016-07-02

    - +

    2016-07-02

    - -

    2016-07-04

    - +

    2016-07-04

    - -

    2016-07-05

    - +

    2016-07-05

    - -

    2016-07-06

    - +
    $ ./fix-metadata-values.py -i /tmp/Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 94 -d dspacetest -u dspacetest -p 'fuuu'
    +
    +

    2016-07-06

    cgspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
     DELETE 23
    -
    - -
  • Complete phase three of metadata migration, for the following fields:

    - + +
  • +
  • Also, run fixes and deletes for species and author affiliations (over 1000 corrections!)
  • +
    $ ./fix-metadata-values.py -i Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 212 -d dspace -u dspace -p 'fuuu'
     $ ./fix-metadata-values.py -i Affiliations-Fix-1045-Peter-Abenet.csv -f dc.contributor.affiliation -t Correct -m 211 -d dspace -u dspace -p 'fuuu'
     $ ./delete-metadata-values.py -f dc.contributor.affiliation -i Affiliations-Delete-Peter-Abenet.csv -m 211 -u dspace -d dspace -p 'fuuu'
    -
    - -
  • I then ran all server updates and rebooted the server

  • + - -

    2016-07-11

    - +

    2016-07-11

    $ ./fix-metadata-values.py -i /tmp/Authors-Fix-205-UTF8.csv -f dc.contributor.author -t correct -m 3 -d dspacetest -u dspacetest -p fuuu
     $ ./delete-metadata-values.py -f dc.contributor.author -i /tmp/Authors-Delete-UTF8.csv -m 3 -u dspacetest -d dspacetest -p fuuu
    -
    - - -

    2016-07-13

    - +

    2016-07-13

    - -

    2016-07-14

    - +

    2016-07-14

    - -

    2016-07-18

    - +

    2016-07-18

    2016-07-18 20:26:30,941 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - 
     org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
     ...
    -
    - -
  • I suspect it’s someone hitting REST too much:

    - -
    # awk '{print $1}' /var/log/nginx/rest.log  | sort -n | uniq -c | sort -h | tail -n 3
    -710 66.249.78.38
    -1781 181.118.144.29
    -24904 70.32.99.142
    -
  • - -
  • I just blocked access to /rest for that last IP for now:

    - -
     # log rest requests
    - location /rest {
    -     access_log /var/log/nginx/rest.log;
    -     proxy_pass http://127.0.0.1:8443;
    -     deny 70.32.99.142;
    - }
    -
  • + - -

    2016-07-21

    - +
    # awk '{print $1}' /var/log/nginx/rest.log  | sort -n | uniq -c | sort -h | tail -n 3
    +    710 66.249.78.38
    +   1781 181.118.144.29
    +  24904 70.32.99.142
    +
    +
         # log rest requests
    +     location /rest {
    +         access_log /var/log/nginx/rest.log;
    +         proxy_pass http://127.0.0.1:8443;
    +         deny 70.32.99.142;
    +     }
    +

    2016-07-21

    - -

    2016-07-22

    - +

    2016-07-22

    index.authority.ignore-prefered.dc.contributor.author=true
     index.authority.ignore-variants.dc.contributor.author=false
    -
    - -
  • After reindexing I don’t see any change in Discovery’s display of authors, and still have entries like:

    - +
    Grace, D. (464)
     Grace, D. (62)
    -
  • - -
  • I asked for clarification of the following options on the DSpace mailing list:

    - +
    index.authority.ignore
     index.authority.ignore-prefered
     index.authority.ignore-variants
    -
  • - -
  • In the mean time, I will try these on DSpace Test (plus a reindex):

    - +
    index.authority.ignore=true
     index.authority.ignore-prefered=true
     index.authority.ignore-variants=true
    -
  • - -
  • Enabled usage of X-Forwarded-For in DSpace admin control panel (#255

  • - -
  • It was misconfigured and disabled, but already working for some reason sigh

  • - -
  • … no luck. Trying with just:

    - -
    index.authority.ignore=true
    -
  • - -
  • After re-indexing and clearing the XMLUI cache nothing has changed

  • + - -

    2016-07-25

    - +
    index.authority.ignore=true
    +
    +

    2016-07-25

    index.authority.ignore-prefered.dc.contributor.author=true
     index.authority.ignore-variants=true
    -
    - -
  • Run all OS updates and reboot DSpace Test server

  • - -
  • No changes to Discovery after reindexing… hmm.

  • - -
  • Integrate and massively clean up About page (#256)

  • + - -

    About page

    - +

    About page

    discovery.index.authority.ignore-prefered.dc.contributor.author=true
     discovery.index.authority.ignore-variants=true
    -
    - -
  • Still no change!

  • - -
  • Deploy species, breed, and identifier changes to CGSpace, as well as About page

  • - -
  • Run Linode RAM upgrade (8→12GB)

  • - -
  • Re-sync DSpace Test with CGSpace

  • - -
  • I noticed that our backup scripts don’t send Solr cores to S3 so I amended the script

  • + - -

    2016-07-31

    - +

    2016-07-31

    - - - -

    2016-08-09

    - +

    2016-08-09

    - -

    2016-08-10

    - +

    2016-08-10

    - -

    2016-08-11

    - +

    2016-08-11

    - -

    DSpace 5.5 on Ubuntu 16.04, Tomcat 7, Java 8, PostgreSQL 9.5

    - -

    2016-08-14

    - +

    DSpace 5.5 on Ubuntu 16.04, Tomcat 7, Java 8, PostgreSQL 9.5

    +

    2016-08-14

    - -

    2016-08-15

    - +

    2016-08-15

    - -

    ExpressJS running behind nginx

    - -

    2016-08-16

    - +

    ExpressJS running behind nginx

    +

    2016-08-16

    - -

    2016-08-17

    - +

    2016-08-17

    - -

    2016-08-18

    - +

    2016-08-18

    - -

    2016-08-21

    - +
    dspace=# update metadatavalue set text_value='CONGO, DR' where resource_type_id=2 and metadata_field_id=228 and text_value='CONGO,DR';
    +
    +

    2016-08-21

    - -

    2016-08-22

    - +

    2016-08-22

    $ ~/dspace/bin/dspace database info
     
     Database URL: jdbc:postgresql://localhost:5432/dspacetest
    @@ -338,106 +283,80 @@ Database Driver: PostgreSQL Native Driver version PostgreSQL 9.1 JDBC4 (build 90
     | 5.1.2015.12.03 | Atmire CUA 4 migration     | 2016-03-21 17:10:41 | Success |
     | 5.1.2015.12.03 | Atmire MQM migration       | 2016-03-21 17:10:42 | Success |
     +----------------+----------------------------+---------------------+---------+
    -
    - -
  • So I’m not sure why they have problems when we move to DSpace 5.5 (even the 5.1 migrations themselves show as “Missing”)

  • + - -

    2016-08-23

    - +

    2016-08-23

    dspacetest=# delete from schema_version where description =  'Atmire CUA 4 migration' and version='5.1.2015.12.03.2';
     dspacetest=# delete from schema_version where description =  'Atmire MQM migration' and version='5.1.2015.12.03.3';
    -
    - -
  • After that DSpace starts up by XMLUI now has unrelated issues that I need to solve!

    - +
    org.apache.avalon.framework.configuration.ConfigurationException: Type 'ThemeResourceReader' does not exist for 'map:read' at jndi:/localhost/themes/0_CGIAR/sitemap.xmap:136:77
     context:/jndi:/localhost/themes/0_CGIAR/sitemap.xmap - 136:77
    -
  • - -
  • Looks like we’re missing some stuff in the XMLUI module’s sitemap.xmap, as well as in each of our XMLUI themes

  • - -
  • Diff them with these to get the ThemeResourceReader changes:

    - + - -

    2016-08-24

    - +
  • +
  • Then we had some NullPointerException from the SolrLogger class, which is apparently part of Atmire's CUA module
  • +
  • I tried with a small version bump to CUA but it didn't work (version 5.5-4.1.1-0)
  • +
  • Also, I started looking into huge pages to prepare for PostgreSQL 9.5, but it seems Linode's kernels don't enable them
  • + +

    2016-08-24

    - -

    2016-08-25

    - +
    dspacetest=# select distinct text_value from metadatavalue where metadata_field_id=55 and text_value !~ '.*(\.pdf|\.png|\.PDF|\.Pdf|\.JPEG|\.jpg|\.JPG|\.jpeg|\.xls|\.rtf|\.docx?|\.potx|\.dotx|\.eqa|\.tiff|\.mp4|\.mp3|\.gif|\.zip|\.txt|\.pptx|\.indd|\.PNG|\.bmp|\.exe|org\.dspace\.app\.mediafilter).*';
    +

    2016-08-25

    ...
     Error creating bean with name 'MetadataStorageInfoService'
     ...
    -
    - -
  • Atmire sent an updated version of dspace/config/spring/api/atmire-cua.xml and now XMLUI starts but gives a null pointer exception:

    - +
    Java stacktrace: java.lang.NullPointerException
    -at org.dspace.app.xmlui.aspect.statistics.Navigation.addOptions(Navigation.java:129)
    -at org.dspace.app.xmlui.wing.AbstractWingTransformer.startElement(AbstractWingTransformer.java:228)
    -at sun.reflect.GeneratedMethodAccessor126.invoke(Unknown Source)
    -at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -at java.lang.reflect.Method.invoke(Method.java:606)
    -at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
    -at com.sun.proxy.$Proxy103.startElement(Unknown Source)
    -at org.apache.cocoon.environment.internal.EnvironmentChanger.startElement(EnvironmentStack.java:140)
    -at org.apache.cocoon.environment.internal.EnvironmentChanger.startElement(EnvironmentStack.java:140)
    -at org.apache.cocoon.xml.AbstractXMLPipe.startElement(AbstractXMLPipe.java:94)
    +    at org.dspace.app.xmlui.aspect.statistics.Navigation.addOptions(Navigation.java:129)
    +    at org.dspace.app.xmlui.wing.AbstractWingTransformer.startElement(AbstractWingTransformer.java:228)
    +    at sun.reflect.GeneratedMethodAccessor126.invoke(Unknown Source)
    +    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    +    at java.lang.reflect.Method.invoke(Method.java:606)
    +    at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
    +    at com.sun.proxy.$Proxy103.startElement(Unknown Source)
    +    at org.apache.cocoon.environment.internal.EnvironmentChanger.startElement(EnvironmentStack.java:140)
    +    at org.apache.cocoon.environment.internal.EnvironmentChanger.startElement(EnvironmentStack.java:140)
    +    at org.apache.cocoon.xml.AbstractXMLPipe.startElement(AbstractXMLPipe.java:94)
     ...
    -
  • - -
  • Import the 47 CCAFS records to CGSpace, creating the SimpleArchiveFormat bundles and importing like:

    - +
    $ ./safbuilder.sh -c /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/3546.csv
     $ JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m" /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/3546 -s /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/SimpleArchiveFormat -m 3546.map
    -
  • - -
  • Finally got DSpace 5.5 working with the Atmire modules after a few rounds of back and forth with Atmire devs

  • + - -

    2016-08-26

    - +

    2016-08-26

    - -

    2016-08-27

    - +
    2016-08-26 20:48:05,040 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -                                                               org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
    +
    +

    2016-08-27

    value.replace("'","").replace(",","").replace('"','')
    -
    - -
  • I need to write a Python script to match that for renaming files in the file system

  • - -
  • When importing SAF bundles it seems you can specify the target collection on the command line using -c 10568/4003 or in the collections file inside each item in the bundle

  • - -
  • Seems that the latter method causes a null pointer exception, so I will just have to use the former method

  • - -
  • In the end I was able to import the files after unzipping them ONLY on Linux

    - + +
  • +
  • Import CIAT Gender Network records to CGSpace, first creating the SAF bundles as my user, then importing as the tomcat7 user, and deleting the bundle, for each collection's items:
  • +
    $ ./safbuilder.sh -c /home/aorth/ciat-gender-2016-09-06/66601.csv
     $ JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m" /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/66601 -s /home/aorth/ciat-gender-2016-09-06/SimpleArchiveFormat -m 66601.map
     $ rm -rf ~/ciat-gender-2016-09-06/SimpleArchiveFormat/
    -
    - - -

    2016-09-07

    - +

    2016-09-07

    2016-09-07 11:39:23,162 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
     org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
     ...
    -
    - -
  • Since CGSpace had crashed I quickly deployed the new LDAP settings before restarting Tomcat

  • + - -

    2016-09-13

    - +

    2016-09-13

    - -

    2016-09-14

    - - - -

    2016-09-29

    - +

    2016-09-29

    - -

    2016-09-30

    - +
    dspacetest=# select distinct text_value from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/5472', '10568/5473')));
    +

    2016-09-30

    diff --git a/docs/2016-10/index.html b/docs/2016-10/index.html index 63fe6e3f2..69b58b0f0 100644 --- a/docs/2016-10/index.html +++ b/docs/2016-10/index.html @@ -8,19 +8,16 @@ @@ -31,21 +28,18 @@ I exported a random item’s metadata as CSV, deleted all columns except id - + @@ -126,196 +120,144 @@ I exported a random item’s metadata as CSV, deleted all columns except id

    -

    2016-10-03

    - +

    2016-10-03

    + +
  • I exported a random item's metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
  • +
    0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
    -
    - - -
    $ ./fix-metadata-values.py -i authors-fix-3180.csv -f dc.contributor.author -t correct -m 3 -d dspacetest -u dspacetest -p fuuu
     $ ./delete-metadata-values.py -i authors-delete-3.csv -f dc.contributor.author -m 3 -d dspacetest -u dspacetest -p fuuu
    -
    - -
  • Remove old about page (#284)

  • - -
  • CGSpace crashed a few times today

  • - -
  • Generate list of unique authors in CCAFS collections:

    - -
    dspacetest=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/32729', '10568/5472', '10568/5473', '10568/10288', '10568/70974', '10568/3547', '10568/3549', '10568/3531','10568/16890','10568/5470','10568/3546', '10568/36024', '10568/66581', '10568/21789', '10568/5469', '10568/5468', '10568/3548', '10568/71053', '10568/25167'))) group by text_value order by count desc) to /tmp/ccafs-authors.csv with csv;
    -
  • + - -

    2016-10-05

    - +
    dspacetest=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/32729', '10568/5472', '10568/5473', '10568/10288', '10568/70974', '10568/3547', '10568/3549', '10568/3531','10568/16890','10568/5470','10568/3546', '10568/36024', '10568/66581', '10568/21789', '10568/5469', '10568/5468', '10568/3548', '10568/71053', '10568/25167'))) group by text_value order by count desc) to /tmp/ccafs-authors.csv with csv;
    +

    2016-10-05

    - -

    2016-10-06

    - +

    2016-10-06

    - -

    CMYK vs sRGB colors

    - +

    CMYK vs sRGB colors

    - -

    2016-10-08

    - +

    2016-10-08

    dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
     DELETE 11
    -
    - -
  • Run all system updates and reboot CGSpace

  • - -
  • Delete ten gigs of old 2015 Tomcat logs that never got rotated (WTF?):

    - +
    root@linode01:~# ls -lh /var/log/tomcat7/localhost_access_log.2015* | wc -l
     47
    -
  • - -
  • Delete 2GB cron-filter-media.log file, as it is just a log from a cron job and it doesn’t get rotated like normal log files (almost a year now maybe)

  • + - -

    2016-10-14

    - +

    2016-10-14

    - -

    2016-10-17

    - +

    2016-10-17

    - -

    2016-10-18

    - +
    $ ./fix-metadata-values.py -i ccafs-authors-oct-16.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu
    +
    +

    2016-10-18

    $ git checkout -b 5_x-55 5_x-prod
     $ git rebase -i dspace-5.5
    -
    - -
  • Have to fix about ten merge conflicts, mostly in the SCSS for the CGIAR theme

  • - -
  • Skip 1e34751b8cf17021f45d4cf2b9a5800c93fb4cb2 in lieu of upstream’s 55e623d1c2b8b7b1fa45db6728e172e06bfa8598 (fixes X-Forwarded-For header) because I had made the same fix myself and it’s better to use the upstream one

  • - -
  • I notice this rebase gets rid of GitHub merge commits… which actually might be fine because merges are fucking annoying to deal with when remote people merge without pulling and rebasing their branch first

  • - -
  • Finished up applying the 5.5 sitemap changes to all themes

  • - -
  • Merge the discovery.xml cleanups (#278)

  • - -
  • Merge some minor edits to the distribution license (#285)

  • + - -

    2016-10-19

    - +

    2016-10-19

    - -

    2016-10-20

    - + + +

    2016-10-20

    - -

    2016-10-25

    - +

    2016-10-25

    $ /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/27629 --child=10568/25101
    -
    - -
  • Start testing some things for DSpace 5.5, like command line metadata import, PDF media filter, and Atmire CUA

  • - -
  • Start looking at batch fixing of “old” ILRI website links without www or https, for example:

    - +
    dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ilri.org%';
    -
  • - -
  • Also CCAFS has HTTPS and their links should use it where possible:

    - +
    dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ccafs.cgiar.org%';
    -
  • - -
  • And this will find community and collection HTML text that is using the old style PNG/JPG icons for RSS and email (we should be using Font Awesome icons instead):

    - +
    dspace=# select text_value from metadatavalue where resource_type_id in (3,4) and text_value like '%Iconrss2.png%';
    -
  • - -
  • Turns out there are shit tons of varieties of this, like with http, https, www, separate </img> tags, alignments, etc

  • - -
  • Had to find all variations and replace them individually:

    - +
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/Iconrss2.png"/>','<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/Iconrss2.png"/>%';
     dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/email.jpg"/>%';
     dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="http://www.ilri.org/images/Iconrss2.png"/>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="http://www.ilri.org/images/Iconrss2.png"/>%';
    @@ -332,20 +274,15 @@ dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<i
     dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="https://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="https://www.ilri.org/images/email.jpg"/>%';
     dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="http://www.ilri.org/images/Iconrss2.png"/>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="http://www.ilri.org/images/Iconrss2.png"/>%';
     dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="http://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="http://www.ilri.org/images/email.jpg"/>%';
    -
  • - -
  • Getting rid of these reduces the number of network requests each client makes on community/collection pages, and makes use of Font Awesome icons (which they are already loading anyways!)

  • - -
  • And now that I start looking, I want to fix a bunch of links to popular sites that should be using HTTPS, like Twitter, Facebook, Google, Feed Burner, DOI, etc

  • - -
  • I should look to see if any of those domains is sending an HTTP 301 or setting HSTS headers to their HTTPS domains, then just replace them

  • + - -

    2016-10-27

    - +

    2016-10-27

    dspace=# \i /tmp/font-awesome-text-replace.sql
     UPDATE 17
     UPDATE 17
    @@ -364,62 +301,48 @@ UPDATE 1
     UPDATE 1
     UPDATE 1
     UPDATE 0
    -
    - -
  • Looks much better now:

  • + - -

    CGSpace with old icons -DSpace Test with Font Awesome icons

    - +

    CGSpace with old icons +DSpace Test with Font Awesome icons

    - -

    2016-10-30

    - +

    2016-10-30

    dspace=# update metadatavalue set authority='799da1d8-22f3-43f5-8233-3d2ef5ebf8a8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Charleston, B.%';
     UPDATE 10
     dspace=# update metadatavalue set authority='e936f5c5-343d-4c46-aa91-7a1fff6277ed', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Knight-Jones%';
     UPDATE 36
    -
    - -
  • I updated the authority index but nothing seemed to change, so I’ll wait and do it again after I update Discovery below

  • - -
  • Skype chat with Tsega about the IFPRI contentdm bridge

  • - -
  • We tested harvesting OAI in an example collection to see how it works

  • - -
  • Talk to Carlos Quiros about CG Core metadata in CGSpace

  • - -
  • Get a list of countries from CGSpace so I can do some batch corrections:

    - +
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=228 group by text_value order by count desc) to /tmp/countries.csv with csv;
    -
  • - -
  • Fix a bunch of countries in Open Refine and run the corrections on CGSpace:

    - +
    $ ./fix-metadata-values.py -i countries-fix-18.csv -f dc.coverage.country -t 'correct' -m 228 -d dspace -u dspace -p fuuu
     $ ./delete-metadata-values.py -i countries-delete-2.csv -f dc.coverage.country -m 228 -d dspace -u dspace -p fuuu
    -
  • - -
  • Run a shit ton of author fixes from Peter Ballantyne that we’ve been cleaning up for two months:

    - +
    $ ./fix-metadata-values.py -i /tmp/authors-fix-pb2.csv -f dc.contributor.author -t correct -m 3 -u dspace -d dspace -p fuuu
    -
  • - -
  • Run a few URL corrections for ilri.org and doi.org, etc:

    - +
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://www.ilri.org','https://www.ilri.org') where resource_type_id=2 and text_value like '%http://www.ilri.org%';
     dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://mahider.ilri.org', 'https://cgspace.cgiar.org') where resource_type_id=2 and text_value like '%http://mahider.%.org%' and metadata_field_id not in (28);
     dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://dx.doi.org') where resource_type_id=2 and text_value like '%http://dx.doi.org%' and metadata_field_id not in (18,26,28,111);
     dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://doi.org', 'https://dx.doi.org') where resource_type_id=2 and text_value like '%http://doi.org%' and metadata_field_id not in (18,26,28,111);
    -
  • - -
  • I skipped metadata fields like citation and description

  • + diff --git a/docs/2016-11/index.html b/docs/2016-11/index.html index 182864688..b79696d04 100644 --- a/docs/2016-11/index.html +++ b/docs/2016-11/index.html @@ -8,9 +8,7 @@ @@ -22,12 +20,10 @@ Add dc.type to the output options for Atmire’s Listings and Reports module - + @@ -108,24 +104,19 @@ Add dc.type to the output options for Atmire’s Listings and Reports module

    -

    2016-11-01

    - +

    2016-11-01

    - -

    Listings and Reports with output type

    - -

    2016-11-02

    - +

    Listings and Reports with output type

    +

    2016-11-02

    2016-11-02 15:09:48,578 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76454 to Index
     2016-11-02 15:09:48,584 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Wrote Community: 10568/3202 to Index
     2016-11-02 15:09:48,589 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76455 to Index
    @@ -136,210 +127,169 @@ Add dc.type to the output options for Atmire’s Listings and Reports module
     2016-11-02 15:09:48,616 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76457 to Index
     2016-11-02 15:09:48,634 ERROR com.atmire.dspace.discovery.AtmireSolrService @
     java.lang.NullPointerException
    -    at org.dspace.discovery.SearchUtils.getDiscoveryConfiguration(SourceFile:57)
    -    at org.dspace.discovery.SolrServiceImpl.buildDocument(SolrServiceImpl.java:824)
    -    at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:821)
    -    at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:898)
    -    at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370)
    -    at org.dspace.storage.rdbms.DatabaseUtils$ReindexerThread.run(DatabaseUtils.java:945)
    -
    - -
  • DSpace is still up, and a few minutes later I see the default DSpace indexer is still running

  • - -
  • Sure enough, looking back before the first one finished, I see output from both indexers interleaved in the log:

    - + at org.dspace.discovery.SearchUtils.getDiscoveryConfiguration(SourceFile:57) + at org.dspace.discovery.SolrServiceImpl.buildDocument(SolrServiceImpl.java:824) + at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:821) + at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:898) + at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370) + at org.dspace.storage.rdbms.DatabaseUtils$ReindexerThread.run(DatabaseUtils.java:945) +
    2016-11-02 15:09:28,545 INFO  org.dspace.discovery.SolrServiceImpl @ Wrote Item: 10568/47242 to Index
     2016-11-02 15:09:28,633 INFO  org.dspace.discovery.SolrServiceImpl @ Wrote Item: 10568/60785 to Index
     2016-11-02 15:09:28,678 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Processing (55695 of 55722): 43557
     2016-11-02 15:09:28,688 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Processing (55703 of 55722): 34476
    -
  • - -
  • I will raise a ticket with Atmire to ask them

  • + - -

    2016-11-06

    - +

    2016-11-06

    - -

    2016-11-07

    - +

    2016-11-07

    $ grep -A 3 contact_info * | grep -E "(Orth|Sisay|Peter|Daniel|Tsega)" | awk -F'-' '{print $1}' | grep linode | uniq | xargs grep linode_id
    -
    - -
  • I noticed some weird CRPs in the database, and they don’t show up in Discovery for some reason, perhaps the :

  • - -
  • I’ll export these and fix them in batch:

    - +
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=230 group by text_value order by count desc) to /tmp/crp.csv with csv;
     COPY 22
    -
  • - -
  • Test running the replacements:

    - +
    $ ./fix-metadata-values.py -i /tmp/CRPs.csv -f cg.contributor.crp -t correct -m 230 -d dspace -u dspace -p 'fuuu'
    -
  • - -
  • Add AMR to ILRI subjects and remove one duplicate instance of IITA in author affiliations controlled vocabulary (#288)

  • + - -

    2016-11-08

    - +

    2016-11-08

    - -

    Listings and Reports broken in DSpace 5.5

    - +

    Listings and Reports broken in DSpace 5.5

    - -

    2016-11-09

    - +
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=3 group by text_value order by count desc limit 210) to /tmp/210-authors.csv with csv;
    +

    2016-11-09

    - -

    2016-11-10

    - +

    2016-11-10

    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}'
    -
    - -
  • But the results are deceiving because metadata fields can have text languages and your query must match exactly!

    - +
    dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
    -text_value | text_lang
    + text_value | text_lang
     ------------+-----------
    -SEEDS      |
    -SEEDS      |
    -SEEDS      | en_US
    + SEEDS      |
    + SEEDS      |
    + SEEDS      | en_US
     (3 rows)
    -
  • - -
  • So basically, the text language here could be null, blank, or en_US

  • - -
  • To query metadata with these properties, you can do:

    - +
    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}' | jq length
     55
     $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":""}' | jq length
     34
     $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":"en_US"}' | jq length
    -
  • - -
  • The results (55+34=89) don’t seem to match those from the database:

    - +
    dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang is null;
    -count
    + count
     -------
    -15
    +    15
     dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang='';
    -count
    + count
     -------
    - 4
    +     4
     dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang='en_US';
    -count
    + count
     -------
    -66
    -
  • - -
  • So, querying from the API I get 55 + 34 = 89 results, but the database actually only has 85…

  • - -
  • And the find-by-metadata-field endpoint doesn’t seem to have a way to get all items with the field, or a wildcard value

  • - -
  • I’ll ask a question on the dspace-tech mailing list

  • - -
  • And speaking of text_lang, this is interesting:

    - + 66 +
    dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2;
    -text_lang
    + text_lang
     -----------
     
    -ethnob
    -en
    -spa
    -EN
    -es
    -frn
    -en_
    -en_US
    + ethnob
    + en
    + spa
    + EN
    + es
    + frn
    + en_
    + en_US
     
    -EN_US
    -eng
    -en_U
    -fr
    + EN_US
    + eng
    + en_U
    + fr
     (14 rows)
    -
  • - -
  • Generate a list of all these so I can maybe fix them in batch:

    - +
    dspace=# \copy (select distinct text_lang, count(*) from metadatavalue where resource_type_id=2 group by text_lang order by count desc) to /tmp/text-langs.csv with csv;
     COPY 14
    -
  • - -
  • Perhaps we need to fix them all in batch, or experiment with fixing only certain metadatavalues:

    - +
    dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
     UPDATE 85
    -
  • - -
  • The fix-metadata.py script I have is meant for specific metadata values, so if I want to update some text_lang values I should just do it directly in the database

  • - -
  • For example, on a limited set:

    - +
    dspace=# update metadatavalue set text_lang=NULL where resource_type_id=2 and metadata_field_id=203 and text_value='LIVESTOCK' and text_lang='';
     UPDATE 420
    -
  • - -
  • And assuming I want to do it for all fields:

    - +
    dspacetest=# update metadatavalue set text_lang=NULL where resource_type_id=2 and text_lang='';
     UPDATE 183726
    -
  • - -
  • After that restarted Tomcat and PostgreSQL (because I’m superstitious about caches) and now I see the following in REST API query:

    - +
    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}' | jq length
     71
     $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":""}' | jq length
     0
     $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":"en_US"}' | jq length
    -
  • - -
  • Not sure what’s going on, but Discovery shows 83 values, and database shows 85, so I’m going to reindex Discovery just in case

  • + - -

    2016-11-14

    - +

    2016-11-14

    $ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
     HTTP/1.1 200 OK
     Connection: keep-alive
    @@ -365,25 +315,19 @@ Server: nginx
     Transfer-Encoding: chunked
     Vary: Accept-Encoding
     X-Cocoon-Version: 2.2.0
    -
    - -
  • The first one gets a session, and any after that — within 60 seconds — will be internally mapped to the same session by Tomcat

  • - -
  • This means that when Google or Baidu slam you with tens of concurrent connections they will all map to ONE internal session, which saves RAM!

  • + - -

    2016-11-15

    - +

    2016-11-15

    - -

    Tomcat JVM heap (day) after setting up the Crawler Session Manager -Tomcat JVM heap (week) after setting up the Crawler Session Manager

    - +

    Tomcat JVM heap (day) after setting up the Crawler Session Manager +Tomcat JVM heap (week) after setting up the Crawler Session Manager

    $ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
     HTTP/1.1 200 OK
     Connection: keep-alive
    @@ -409,115 +353,89 @@ Set-Cookie: JSESSIONID=F6403C084480F765ED787E41D2521903; Path=/; Secure; HttpOnl
     Transfer-Encoding: chunked
     Vary: Accept-Encoding
     X-Cocoon-Version: 2.2.0
    -
    - -
  • Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:

    - +
    <!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers -->
     <Valve className="org.apache.catalina.valves.CrawlerSessionManagerValve"
    -   crawlerUserAgents=".*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*" />
    -
  • - -
  • Looking at the bots that were active yesterday it seems the above regex should be sufficient:

    - + crawlerUserAgents=".*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*" /> +
    $ grep -o -E 'Mozilla/5\.0 \(compatible;.*\"' /var/log/nginx/access.log | sort | uniq
     Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" "-"
     Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
     Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
     Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" "-"
     Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)" "-"
    -
  • - -
  • Neat maven trick to exclude some modules from being built:

    - -
    $ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=localhost -P \!dspace-lni,\!dspace-rdf,\!dspace-sword,\!dspace-swordv2 clean package
    -
  • - -
  • We absolutely don’t use those modules, so we shouldn’t build them in the first place

  • + - -

    2016-11-17

    - +
    $ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=localhost -P \!dspace-lni,\!dspace-rdf,\!dspace-sword,\!dspace-swordv2 clean package
    +
    +

    2016-11-17

    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc) to /tmp/journal-titles.csv with csv;
     COPY 2515
    -
    - -
  • Send a message to users of the CGSpace REST API to notify them of upcoming upgrade so they can test their apps against DSpace Test

  • - -
  • Test an update old, non-HTTPS links to the CCAFS website in CGSpace metadata:

    - +
    dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, 'http://ccafs.cgiar.org','https://ccafs.cgiar.org') where resource_type_id=2 and text_value like '%http://ccafs.cgiar.org%';
     UPDATE 164
     dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://ccafs.cgiar.org','https://ccafs.cgiar.org') where resource_type_id=2 and text_value like '%http://ccafs.cgiar.org%';
     UPDATE 7
    -
  • - -
  • Had to run it twice to get all (not sure about “global” regex in PostgreSQL)

  • - -
  • Run the updates on CGSpace as well

  • - -
  • Run through some collections and manually regenerate some PDF thumbnails for items from before 2016 on DSpace Test to compare with CGSpace

  • - -
  • I’m debating forcing the re-generation of ALL thumbnails, since some come from DSpace 3 and 4 when the thumbnailing wasn’t as good

  • - -
  • The results were very good, I think that after we upgrade to 5.5 I will do it, perhaps one community / collection at a time:

    - -
    $ [dspace]/bin/dspace filter-media -f -i 10568/67156 -p "ImageMagick PDF Thumbnail"
    -
  • - -
  • In related news, I’m looking at thumbnails of thumbnails (the ones we uploaded manually before, and now DSpace’s media filter has made thumbnails of THEM):

    - -
    dspace=# select text_value from metadatavalue where text_value like '%.jpg.jpg';
    -
  • - -
  • I’m not sure if there’s anything we can do, actually, because we would have to remove those from the thumbnail bundles, and replace them with the regular JPGs from the content bundle, and then remove them from the assetstore…

  • + - -

    2016-11-18

    - +
    $ [dspace]/bin/dspace filter-media -f -i 10568/67156 -p "ImageMagick PDF Thumbnail"
    +
    +
    dspace=# select text_value from metadatavalue where text_value like '%.jpg.jpg';
    +
    +

    2016-11-18

    - -

    2016-11-21

    - +

    2016-11-21

    - -

    2016-11-23

    - +

    2016-11-23

    - -

    2016-11-24

    - +

    2016-11-24

    - -

    2016-11-27

    - +

    2016-11-27

    +
  • Need to do updates for ansible infrastructure role defaults, and switch the GitHub branch to the new 5.5 one
  • -
  • Testing DSpace 5.5 on CGSpace, it seems CUA’s export as XLS works for Usage statistics, but not Content statistics
  • +
  • Testing DSpace 5.5 on CGSpace, it seems CUA's export as XLS works for Usage statistics, but not Content statistics
  • I will raise a bug with Atmire
  • - -

    2016-11-28

    - +

    2016-11-28

    INFO: FrameworkServlet 'oai': initialization completed in 2600 ms
     [>                                                  ] 0% time remaining: Calculating... timestamp: 2016-11-28 03:00:18
     [>                                                  ] 0% time remaining: 11 hour(s) 57 minute(s) 46 seconds. timestamp: 2016-11-28 03:00:19
    @@ -548,34 +464,25 @@ UPDATE 7
     [>                                                  ] 0% time remaining: 14 hour(s) 5 minute(s) 56 seconds. timestamp: 2016-11-28 03:00:19
     [>                                                  ] 0% time remaining: 11 hour(s) 23 minute(s) 49 seconds. timestamp: 2016-11-28 03:00:19
     [>                                                  ] 0% time remaining: 11 hour(s) 21 minute(s) 57 seconds. timestamp: 2016-11-28 03:00:20
    -
    - -
  • It says OAI, and seems to start at 3:00 AM, but I only see the filter-media cron job set to start then

  • - -
  • Double checking the DSpace 5.x upgrade notes for anything I missed, or troubleshooting tips

  • - -
  • Running some manual processes just in case:

    - +
    $ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacetest.cgiar.org/config/registries/dcterms-types.xml
     $ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacetest.cgiar.org/config/registries/dublin-core-types.xml
     $ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacetest.cgiar.org/config/registries/eperson-types.xml
     $ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacetest.cgiar.org/config/registries/workflow-types.xml
    -
  • - -
  • Start working on paper for KM4Dev journal

  • - -
  • Wow, Bram from Atmire pointed out this solution for using multiple handles with one DSpace instance: https://wiki.duraspace.org/display/DSDOC5x/Installing+DSpace?focusedCommentId=78163296#comment-78163296

  • - -
  • We might be able to migrate the CGIAR Library now, as they had wanted to keep their handles

  • + - -

    2016-11-29

    - +

    2016-11-29

    2016-11-29 07:56:36,350 INFO  org.dspace.authenticate.LDAPAuthentication @ g.cherinet@cgiar.org:session_id=F628E13AB4EF2BA949198A99EFD8EBE4:ip_addr=213.55.99.121:failed_login:no DN found for user g.cherinet@cgiar.org
     2016-11-29 07:56:36,350 INFO  org.dspace.authenticate.PasswordAuthentication @ g.cherinet@cgiar.org:session_id=F628E13AB4EF2BA949198A99EFD8EBE4:ip_addr=213.55.99.121:authenticate:attempting password auth of user=g.cherinet@cgiar.org
     2016-11-29 07:56:36,352 INFO  org.dspace.app.xmlui.utils.AuthenticationUtil @ g.cherinet@cgiar.org:session_id=F628E13AB4EF2BA949198A99EFD8EBE4:ip_addr=213.55.99.121:failed_login:email=g.cherinet@cgiar.org, realm=null, result=2
    @@ -587,39 +494,30 @@ $ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacete
     2016-11-29 07:56:36,701 INFO  org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ facets for scope, null: 23
     2016-11-29 07:56:36,747 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
     org.dspace.discovery.SearchServiceException: Error executing query
    -    at org.dspace.discovery.SolrServiceImpl.search(SolrServiceImpl.java:1618)
    -    at org.dspace.discovery.SolrServiceImpl.search(SolrServiceImpl.java:1600)
    -    at org.dspace.discovery.SolrServiceImpl.search(SolrServiceImpl.java:1583)
    -    at org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer.performSearch(SidebarFacetsTransformer.java:165)
    -    at org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer.addOptions(SidebarFacetsTransformer.java:174)
    -    at org.dspace.app.xmlui.wing.AbstractWingTransformer.startElement(AbstractWingTransformer.java:228)
    -    at sun.reflect.GeneratedMethodAccessor277.invoke(Unknown Source)
    +        at org.dspace.discovery.SolrServiceImpl.search(SolrServiceImpl.java:1618)
    +        at org.dspace.discovery.SolrServiceImpl.search(SolrServiceImpl.java:1600)
    +        at org.dspace.discovery.SolrServiceImpl.search(SolrServiceImpl.java:1583)
    +        at org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer.performSearch(SidebarFacetsTransformer.java:165)
    +        at org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer.addOptions(SidebarFacetsTransformer.java:174)
    +        at org.dspace.app.xmlui.wing.AbstractWingTransformer.startElement(AbstractWingTransformer.java:228)
    +        at sun.reflect.GeneratedMethodAccessor277.invoke(Unknown Source)
     ...
    -
    - -
  • At about the same time in the solr log I see a super long query:

    - -
    2016-11-29 07:56:36,734 INFO  org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=*:*&fl=dateIssued.year,handle,search.resourcetype,search.resourceid,search.uniqueid&start=0&fq=NOT(withdrawn:true)&fq=NOT(discoverable:false)&fq=dateIssued.year:[*+TO+*]&fq=read:(g0+OR+e574+OR+g0+OR+g3+OR+g9+OR+g10+OR+g14+OR+g16+OR+g18+OR+g20+OR+g23+OR+g24+OR+g2072+OR+g2074+OR+g28+OR+g2076+OR+g29+OR+g2078+OR+g2080+OR+g34+OR+g2082+OR+g2084+OR+g38+OR+g2086+OR+g2088+OR+g2091+OR+g43+OR+g2092+OR+g2093+OR+g2095+OR+g2097+OR+g50+OR+g2099+OR+g51+OR+g2103+OR+g62+OR+g65+OR+g2115+OR+g2117+OR+g2119+OR+g2121+OR+g2123+OR+g2125+OR+g77+OR+g78+OR+g79+OR+g2127+OR+g80+OR+g2129+OR+g2131+OR+g2133+OR+g2134+OR+g2135+OR+g2136+OR+g2137+OR+g2138+OR+g2139+OR+g2140+OR+g2141+OR+g2142+OR+g2148+OR+g2149+OR+g2150+OR+g2151+OR+g2152+OR+g2153+OR+g2154+OR+g2156+OR+g2165+OR+g2167+OR+g2171+OR+g2174+OR+g2175+OR+g129+OR+g2182+OR+g2186+OR+g2189+OR+g153+OR+g158+OR+g166+OR+g167+OR+g168+OR+g169+OR+g2225+OR+g179+OR+g2227+OR+g2229+OR+g183+OR+g2231+OR+g184+OR+g2233+OR+g186+OR+g2235+OR+g2237+OR+g191+OR+g192+OR+g193+OR+g202+OR+g203+OR+g204+OR+g205+OR+g207+OR+g208+OR+g218+OR+g219+OR+g222+OR+g223+OR+g230+OR+g231+OR+g238+OR+g241+OR+g244+OR+g254+OR+g255+OR+g262+OR+g265+OR+g268+OR+g269+OR+g273+OR+g276+OR+g277+OR+g279+OR+g282+OR+g2332+OR+g2335+OR+g2338+OR+g292+OR+g293+OR+g2341+OR+g296+OR+g2344+OR+g297+OR+g2347+OR+g301+OR+g2350+OR+g303+OR+g305+OR+g2356+OR+g310+OR+g311+OR+g2359+OR+g313+OR+g2362+OR+g2365+OR+g2368+OR+g321+OR+g2371+OR+g325+OR+g2374+OR+g328+OR+g2377+OR+g2380+OR+g333+OR+g2383+OR+g2386+OR+g2389+OR+g342+OR+g343+OR+g2392+OR+g345+OR+g2395+OR+g348+OR+g2398+OR+g2401+OR+g2404+OR+g2407+OR+g364+OR+g366+OR+g2425+OR+g2427+OR+g385+OR+g387+OR+g388+OR+g389+OR+g2442+OR+g395+OR+g2443+OR+g2444+OR+g401+OR+g403+OR+g405+OR+g408+OR+g2457+OR+g2458+OR+g411+OR+g2459+OR+g414+OR+g2463+OR+g417+OR+g2465+OR+g2467+OR+g421+OR+g2469+OR+g2471+OR+g424+OR+g2473+OR+g2475+OR+g2476+OR+g429+OR+g433+OR+g2481+OR+g2482+OR+g2483+OR+g443+OR+g444+OR+g445+OR+g446+OR+g448+OR+g453+OR+g455+OR+g456+OR+g457+OR+g458+OR+g459+OR+g461+OR+g462+OR+g463+OR+g464+OR+g465+OR+g467+OR+g468+OR+g469+OR+g474+OR+g476+OR+g477+OR+g480+OR+g483+OR+g484+OR+g493+OR+g496+OR+g497+OR+g498+OR+g500+OR+g502+OR+g504+OR+g505+OR+g2559+OR+g2560+OR+g513+OR+g2561+OR+g515+OR+g516+OR+g518+OR+g519+OR+g2567+OR+g520+OR+g521+OR+g522+OR+g2570+OR+g523+OR+g2571+OR+g524+OR+g525+OR+g2573+OR+g526+OR+g2574+OR+g527+OR+g528+OR+g2576+OR+g529+OR+g531+OR+g2579+OR+g533+OR+g534+OR+g2582+OR+g535+OR+g2584+OR+g538+OR+g2586+OR+g540+OR+g2588+OR+g541+OR+g543+OR+g544+OR+g545+OR+g546+OR+g548+OR+g2596+OR+g549+OR+g551+OR+g555+OR+g556+OR+g558+OR+g561+OR+g569+OR+g570+OR+g571+OR+g2619+OR+g572+OR+g2620+OR+g573+OR+g2621+OR+g2622+OR+g575+OR+g578+OR+g581+OR+g582+OR+g584+OR+g585+OR+g586+OR+g587+OR+g588+OR+g590+OR+g591+OR+g593+OR+g595+OR+g596+OR+g598+OR+g599+OR+g601+OR+g602+OR+g603+OR+g604+OR+g605+OR+g606+OR+g608+OR+g609+OR+g610+OR+g612+OR+g614+OR+g616+OR+g620+OR+g621+OR+g623+OR+g630+OR+g635+OR+g636+OR+g646+OR+g649+OR+g683+OR+g684+OR+g687+OR+g689+OR+g691+OR+g695+OR+g697+OR+g698+OR+g699+OR+g700+OR+g701+OR+g707+OR+g708+OR+g709+OR+g710+OR+g711+OR+g712+OR+g713+OR+g714+OR+g715+OR+g716+OR+g717+OR+g719+OR+g720+OR+g729+OR+g732+OR+g733+OR+g734+OR+g736+OR+g737+OR+g738+OR+g2786+OR+g752+OR+g754+OR+g2804+OR+g757+OR+g2805+OR+g2806+OR+g760+OR+g761+OR+g2810+OR+g2815+OR+g769+OR+g771+OR+g773+OR+g776+OR+g786+OR+g787+OR+g788+OR+g789+OR+g791+OR+g792+OR+g793+OR+g794+OR+g795+OR+g796+OR+g798+OR+g800+OR+g802+OR+g803+OR+g806+OR+g808+OR+g810+OR+g814+OR+g815+OR+g817+OR+g829+OR+g830+OR+g849+OR+g893+OR+g895+OR+g898+OR+g902+OR+g903+OR+g917+OR+g919+OR+g921+OR+g922+OR+g923+OR+g924+OR+g925+OR+g926+OR+g927+OR+g928+OR+g929+OR+g930+OR+g932+OR+g933+OR+g934+OR+g938+OR+g939+OR+g944+OR+g945+OR+g946+OR+g947+OR+g948+OR+g949+OR+g950+OR+g951+OR+g953+OR+g954+OR+g955+OR+g956+OR+g958+OR+g959+OR+g960+OR+g963+OR+g964+OR+g965+OR+g968+OR+g969+OR+g970+OR+g971+OR+g972+OR+g973+OR+g974+OR+g976+OR+g978+OR+g979+OR+g984+OR+g985+OR+g987+OR+g988+OR+g991+OR+g993+OR+g994+OR+g999+OR+g1000+OR+g1003+OR+g1005+OR+g1006+OR+g1007+OR+g1012+OR+g1013+OR+g1015+OR+g1016+OR+g1018+OR+g1023+OR+g1024+OR+g1026+OR+g1028+OR+g1030+OR+g1032+OR+g1033+OR+g1035+OR+g1036+OR+g1038+OR+g1039+OR+g1041+OR+g1042+OR+g1044+OR+g1045+OR+g1047+OR+g1048+OR+g1050+OR+g1051+OR+g1053+OR+g1054+OR+g1056+OR+g1057+OR+g1058+OR+g1059+OR+g1060+OR+g1061+OR+g1062+OR+g1063+OR+g1064+OR+g1065+OR+g1066+OR+g1068+OR+g1071+OR+g1072+OR+g1074+OR+g1075+OR+g1076+OR+g1077+OR+g1078+OR+g1080+OR+g1081+OR+g1082+OR+g1084+OR+g1085+OR+g1087+OR+g1088+OR+g1089+OR+g1090+OR+g1091+OR+g1092+OR+g1093+OR+g1094+OR+g1095+OR+g1096+OR+g1097+OR+g1106+OR+g1108+OR+g1110+OR+g1112+OR+g1114+OR+g1117+OR+g1120+OR+g1121+OR+g1126+OR+g1128+OR+g1129+OR+g1131+OR+g1136+OR+g1138+OR+g1140+OR+g1141+OR+g1143+OR+g1145+OR+g1146+OR+g1148+OR+g1152+OR+g1154+OR+g1156+OR+g1158+OR+g1159+OR+g1160+OR+g1162+OR+g1163+OR+g1165+OR+g1166+OR+g1168+OR+g1170+OR+g1172+OR+g1175+OR+g1177+OR+g1179+OR+g1181+OR+g1185+OR+g1191+OR+g1193+OR+g1197+OR+g1199+OR+g1201+OR+g1203+OR+g1204+OR+g1215+OR+g1217+OR+g1219+OR+g1221+OR+g1224+OR+g1226+OR+g1227+OR+g1228+OR+g1230+OR+g1231+OR+g1232+OR+g1233+OR+g1234+OR+g1235+OR+g1236+OR+g1237+OR+g1238+OR+g1240+OR+g1241+OR+g1242+OR+g1243+OR+g1244+OR+g1246+OR+g1248+OR+g1250+OR+g1252+OR+g1254+OR+g1256+OR+g1257+OR+g1259+OR+g1261+OR+g1263+OR+g1275+OR+g1276+OR+g1277+OR+g1278+OR+g1279+OR+g1282+OR+g1284+OR+g1288+OR+g1290+OR+g1293+OR+g1296+OR+g1297+OR+g1299+OR+g1303+OR+g1304+OR+g1306+OR+g1309+OR+g1310+OR+g1311+OR+g1312+OR+g1313+OR+g1316+OR+g1318+OR+g1320+OR+g1322+OR+g1323+OR+g1324+OR+g1325+OR+g1326+OR+g1329+OR+g1331+OR+g1347+OR+g1348+OR+g1361+OR+g1362+OR+g1363+OR+g1364+OR+g1367+OR+g1368+OR+g1369+OR+g1370+OR+g1371+OR+g1374+OR+g1376+OR+g1377+OR+g1378+OR+g1380+OR+g1381+OR+g1386+OR+g1389+OR+g1391+OR+g1392+OR+g1393+OR+g1395+OR+g1396+OR+g1397+OR+g1400+OR+g1402+OR+g1406+OR+g1408+OR+g1415+OR+g1417+OR+g1433+OR+g1435+OR+g1441+OR+g1442+OR+g1443+OR+g1444+OR+g1446+OR+g1448+OR+g1450+OR+g1451+OR+g1452+OR+g1453+OR+g1454+OR+g1456+OR+g1458+OR+g1460+OR+g1462+OR+g1464+OR+g1466+OR+g1468+OR+g1470+OR+g1471+OR+g1475+OR+g1476+OR+g1477+OR+g1478+OR+g1479+OR+g1481+OR+g1482+OR+g1483+OR+g1484+OR+g1485+OR+g1486+OR+g1487+OR+g1488+OR+g1489+OR+g1490+OR+g1491+OR+g1492+OR+g1493+OR+g1495+OR+g1497+OR+g1499+OR+g1501+OR+g1503+OR+g1504+OR+g1506+OR+g1508+OR+g1511+OR+g1512+OR+g1513+OR+g1516+OR+g1522+OR+g1535+OR+g1536+OR+g1537+OR+g1539+OR+g1540+OR+g1541+OR+g1542+OR+g1547+OR+g1549+OR+g1551+OR+g1553+OR+g1555+OR+g1557+OR+g1559+OR+g1561+OR+g1563+OR+g1565+OR+g1567+OR+g1569+OR+g1571+OR+g1573+OR+g1580+OR+g1583+OR+g1588+OR+g1590+OR+g1592+OR+g1594+OR+g1595+OR+g1596+OR+g1598+OR+g1599+OR+g1600+OR+g1601+OR+g1602+OR+g1604+OR+g1606+OR+g1610+OR+g1611+OR+g1612+OR+g1613+OR+g1616+OR+g1619+OR+g1622+OR+g1624+OR+g1625+OR+g1626+OR+g1628+OR+g1629+OR+g1631+OR+g1632+OR+g1692+OR+g1694+OR+g1695+OR+g1697+OR+g1705+OR+g1706+OR+g1707+OR+g1708+OR+g1711+OR+g1715+OR+g1717+OR+g1719+OR+g1721+OR+g1722+OR+g1723+OR+g1724+OR+g1725+OR+g1726+OR+g1727+OR+g1731+OR+g1732+OR+g1736+OR+g1737+OR+g1738+OR+g1740+OR+g1742+OR+g1743+OR+g1753+OR+g1755+OR+g1758+OR+g1759+OR+g1764+OR+g1766+OR+g1769+OR+g1774+OR+g1782+OR+g1794+OR+g1796+OR+g1797+OR+g1814+OR+g1818+OR+g1826+OR+g1853+OR+g1855+OR+g1857+OR+g1858+OR+g1859+OR+g1860+OR+g1861+OR+g1863+OR+g1864+OR+g1865+OR+g1867+OR+g1869+OR+g1871+OR+g1873+OR+g1875+OR+g1877+OR+g1879+OR+g1881+OR+g1883+OR+g1884+OR+g1885+OR+g1887+OR+g1889+OR+g1891+OR+g1892+OR+g1894+OR+g1896+OR+g1898+OR+g1900+OR+g1902+OR+g1907+OR+g1910+OR+g1915+OR+g1916+OR+g1917+OR+g1918+OR+g1929+OR+g1931+OR+g1932+OR+g1933+OR+g1934+OR+g1936+OR+g1937+OR+g1938+OR+g1939+OR+g1940+OR+g1942+OR+g1944+OR+g1945+OR+g1948+OR+g1950+OR+g1955+OR+g1961+OR+g1962+OR+g1964+OR+g1966+OR+g1968+OR+g1970+OR+g1972+OR+g1974+OR+g1976+OR+g1979+OR+g1982+OR+g1984+OR+g1985+OR+g1986+OR+g1987+OR+g1989+OR+g1991+OR+g1996+OR+g2003+OR+g2007+OR+g2011+OR+g2019+OR+g2020+OR+g2046)&sort=dateIssued.year_sort+desc&rows=1&wt=javabin&version=2} hits=56080 status=0 QTime=3
    -
  • - -
  • Which, according to some old threads on DSpace Tech, means that the user has a lot of permissions (from groups or on the individual eperson) which increases the Solr query size / query URL

  • - -
  • It might be fixed by increasing the Tomcat maxHttpHeaderSize, which is 8192 (or 8KB) by default

  • - -
  • I’ve increased the maxHttpHeaderSize to 16384 on DSpace Test and the user said he is now able to see the communities on the homepage

  • - -
  • I will make the changes on CGSpace soon

  • - -
  • A few users are reporting having issues with their workflows, they get the following message: “You are not allowed to perform this task”

  • - -
  • Might be the same as DS-2920 on the bug tracker

  • + - -

    2016-11-30

    - +
    2016-11-29 07:56:36,734 INFO  org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=*:*&fl=dateIssued.year,handle,search.resourcetype,search.resourceid,search.uniqueid&start=0&fq=NOT(withdrawn:true)&fq=NOT(discoverable:false)&fq=dateIssued.year:[*+TO+*]&fq=read:(g0+OR+e574+OR+g0+OR+g3+OR+g9+OR+g10+OR+g14+OR+g16+OR+g18+OR+g20+OR+g23+OR+g24+OR+g2072+OR+g2074+OR+g28+OR+g2076+OR+g29+OR+g2078+OR+g2080+OR+g34+OR+g2082+OR+g2084+OR+g38+OR+g2086+OR+g2088+OR+g2091+OR+g43+OR+g2092+OR+g2093+OR+g2095+OR+g2097+OR+g50+OR+g2099+OR+g51+OR+g2103+OR+g62+OR+g65+OR+g2115+OR+g2117+OR+g2119+OR+g2121+OR+g2123+OR+g2125+OR+g77+OR+g78+OR+g79+OR+g2127+OR+g80+OR+g2129+OR+g2131+OR+g2133+OR+g2134+OR+g2135+OR+g2136+OR+g2137+OR+g2138+OR+g2139+OR+g2140+OR+g2141+OR+g2142+OR+g2148+OR+g2149+OR+g2150+OR+g2151+OR+g2152+OR+g2153+OR+g2154+OR+g2156+OR+g2165+OR+g2167+OR+g2171+OR+g2174+OR+g2175+OR+g129+OR+g2182+OR+g2186+OR+g2189+OR+g153+OR+g158+OR+g166+OR+g167+OR+g168+OR+g169+OR+g2225+OR+g179+OR+g2227+OR+g2229+OR+g183+OR+g2231+OR+g184+OR+g2233+OR+g186+OR+g2235+OR+g2237+OR+g191+OR+g192+OR+g193+OR+g202+OR+g203+OR+g204+OR+g205+OR+g207+OR+g208+OR+g218+OR+g219+OR+g222+OR+g223+OR+g230+OR+g231+OR+g238+OR+g241+OR+g244+OR+g254+OR+g255+OR+g262+OR+g265+OR+g268+OR+g269+OR+g273+OR+g276+OR+g277+OR+g279+OR+g282+OR+g2332+OR+g2335+OR+g2338+OR+g292+OR+g293+OR+g2341+OR+g296+OR+g2344+OR+g297+OR+g2347+OR+g301+OR+g2350+OR+g303+OR+g305+OR+g2356+OR+g310+OR+g311+OR+g2359+OR+g313+OR+g2362+OR+g2365+OR+g2368+OR+g321+OR+g2371+OR+g325+OR+g2374+OR+g328+OR+g2377+OR+g2380+OR+g333+OR+g2383+OR+g2386+OR+g2389+OR+g342+OR+g343+OR+g2392+OR+g345+OR+g2395+OR+g348+OR+g2398+OR+g2401+OR+g2404+OR+g2407+OR+g364+OR+g366+OR+g2425+OR+g2427+OR+g385+OR+g387+OR+g388+OR+g389+OR+g2442+OR+g395+OR+g2443+OR+g2444+OR+g401+OR+g403+OR+g405+OR+g408+OR+g2457+OR+g2458+OR+g411+OR+g2459+OR+g414+OR+g2463+OR+g417+OR+g2465+OR+g2467+OR+g421+OR+g2469+OR+g2471+OR+g424+OR+g2473+OR+g2475+OR+g2476+OR+g429+OR+g433+OR+g2481+OR+g2482+OR+g2483+OR+g443+OR+g444+OR+g445+OR+g446+OR+g448+OR+g453+OR+g455+OR+g456+OR+g457+OR+g458+OR+g459+OR+g461+OR+g462+OR+g463+OR+g464+OR+g465+OR+g467+OR+g468+OR+g469+OR+g474+OR+g476+OR+g477+OR+g480+OR+g483+OR+g484+OR+g493+OR+g496+OR+g497+OR+g498+OR+g500+OR+g502+OR+g504+OR+g505+OR+g2559+OR+g2560+OR+g513+OR+g2561+OR+g515+OR+g516+OR+g518+OR+g519+OR+g2567+OR+g520+OR+g521+OR+g522+OR+g2570+OR+g523+OR+g2571+OR+g524+OR+g525+OR+g2573+OR+g526+OR+g2574+OR+g527+OR+g528+OR+g2576+OR+g529+OR+g531+OR+g2579+OR+g533+OR+g534+OR+g2582+OR+g535+OR+g2584+OR+g538+OR+g2586+OR+g540+OR+g2588+OR+g541+OR+g543+OR+g544+OR+g545+OR+g546+OR+g548+OR+g2596+OR+g549+OR+g551+OR+g555+OR+g556+OR+g558+OR+g561+OR+g569+OR+g570+OR+g571+OR+g2619+OR+g572+OR+g2620+OR+g573+OR+g2621+OR+g2622+OR+g575+OR+g578+OR+g581+OR+g582+OR+g584+OR+g585+OR+g586+OR+g587+OR+g588+OR+g590+OR+g591+OR+g593+OR+g595+OR+g596+OR+g598+OR+g599+OR+g601+OR+g602+OR+g603+OR+g604+OR+g605+OR+g606+OR+g608+OR+g609+OR+g610+OR+g612+OR+g614+OR+g616+OR+g620+OR+g621+OR+g623+OR+g630+OR+g635+OR+g636+OR+g646+OR+g649+OR+g683+OR+g684+OR+g687+OR+g689+OR+g691+OR+g695+OR+g697+OR+g698+OR+g699+OR+g700+OR+g701+OR+g707+OR+g708+OR+g709+OR+g710+OR+g711+OR+g712+OR+g713+OR+g714+OR+g715+OR+g716+OR+g717+OR+g719+OR+g720+OR+g729+OR+g732+OR+g733+OR+g734+OR+g736+OR+g737+OR+g738+OR+g2786+OR+g752+OR+g754+OR+g2804+OR+g757+OR+g2805+OR+g2806+OR+g760+OR+g761+OR+g2810+OR+g2815+OR+g769+OR+g771+OR+g773+OR+g776+OR+g786+OR+g787+OR+g788+OR+g789+OR+g791+OR+g792+OR+g793+OR+g794+OR+g795+OR+g796+OR+g798+OR+g800+OR+g802+OR+g803+OR+g806+OR+g808+OR+g810+OR+g814+OR+g815+OR+g817+OR+g829+OR+g830+OR+g849+OR+g893+OR+g895+OR+g898+OR+g902+OR+g903+OR+g917+OR+g919+OR+g921+OR+g922+OR+g923+OR+g924+OR+g925+OR+g926+OR+g927+OR+g928+OR+g929+OR+g930+OR+g932+OR+g933+OR+g934+OR+g938+OR+g939+OR+g944+OR+g945+OR+g946+OR+g947+OR+g948+OR+g949+OR+g950+OR+g951+OR+g953+OR+g954+OR+g955+OR+g956+OR+g958+OR+g959+OR+g960+OR+g963+OR+g964+OR+g965+OR+g968+OR+g969+OR+g970+OR+g971+OR+g972+OR+g973+OR+g974+OR+g976+OR+g978+OR+g979+OR+g984+OR+g985+OR+g987+OR+g988+OR+g991+OR+g993+OR+g994+OR+g999+OR+g1000+OR+g1003+OR+g1005+OR+g1006+OR+g1007+OR+g1012+OR+g1013+OR+g1015+OR+g1016+OR+g1018+OR+g1023+OR+g1024+OR+g1026+OR+g1028+OR+g1030+OR+g1032+OR+g1033+OR+g1035+OR+g1036+OR+g1038+OR+g1039+OR+g1041+OR+g1042+OR+g1044+OR+g1045+OR+g1047+OR+g1048+OR+g1050+OR+g1051+OR+g1053+OR+g1054+OR+g1056+OR+g1057+OR+g1058+OR+g1059+OR+g1060+OR+g1061+OR+g1062+OR+g1063+OR+g1064+OR+g1065+OR+g1066+OR+g1068+OR+g1071+OR+g1072+OR+g1074+OR+g1075+OR+g1076+OR+g1077+OR+g1078+OR+g1080+OR+g1081+OR+g1082+OR+g1084+OR+g1085+OR+g1087+OR+g1088+OR+g1089+OR+g1090+OR+g1091+OR+g1092+OR+g1093+OR+g1094+OR+g1095+OR+g1096+OR+g1097+OR+g1106+OR+g1108+OR+g1110+OR+g1112+OR+g1114+OR+g1117+OR+g1120+OR+g1121+OR+g1126+OR+g1128+OR+g1129+OR+g1131+OR+g1136+OR+g1138+OR+g1140+OR+g1141+OR+g1143+OR+g1145+OR+g1146+OR+g1148+OR+g1152+OR+g1154+OR+g1156+OR+g1158+OR+g1159+OR+g1160+OR+g1162+OR+g1163+OR+g1165+OR+g1166+OR+g1168+OR+g1170+OR+g1172+OR+g1175+OR+g1177+OR+g1179+OR+g1181+OR+g1185+OR+g1191+OR+g1193+OR+g1197+OR+g1199+OR+g1201+OR+g1203+OR+g1204+OR+g1215+OR+g1217+OR+g1219+OR+g1221+OR+g1224+OR+g1226+OR+g1227+OR+g1228+OR+g1230+OR+g1231+OR+g1232+OR+g1233+OR+g1234+OR+g1235+OR+g1236+OR+g1237+OR+g1238+OR+g1240+OR+g1241+OR+g1242+OR+g1243+OR+g1244+OR+g1246+OR+g1248+OR+g1250+OR+g1252+OR+g1254+OR+g1256+OR+g1257+OR+g1259+OR+g1261+OR+g1263+OR+g1275+OR+g1276+OR+g1277+OR+g1278+OR+g1279+OR+g1282+OR+g1284+OR+g1288+OR+g1290+OR+g1293+OR+g1296+OR+g1297+OR+g1299+OR+g1303+OR+g1304+OR+g1306+OR+g1309+OR+g1310+OR+g1311+OR+g1312+OR+g1313+OR+g1316+OR+g1318+OR+g1320+OR+g1322+OR+g1323+OR+g1324+OR+g1325+OR+g1326+OR+g1329+OR+g1331+OR+g1347+OR+g1348+OR+g1361+OR+g1362+OR+g1363+OR+g1364+OR+g1367+OR+g1368+OR+g1369+OR+g1370+OR+g1371+OR+g1374+OR+g1376+OR+g1377+OR+g1378+OR+g1380+OR+g1381+OR+g1386+OR+g1389+OR+g1391+OR+g1392+OR+g1393+OR+g1395+OR+g1396+OR+g1397+OR+g1400+OR+g1402+OR+g1406+OR+g1408+OR+g1415+OR+g1417+OR+g1433+OR+g1435+OR+g1441+OR+g1442+OR+g1443+OR+g1444+OR+g1446+OR+g1448+OR+g1450+OR+g1451+OR+g1452+OR+g1453+OR+g1454+OR+g1456+OR+g1458+OR+g1460+OR+g1462+OR+g1464+OR+g1466+OR+g1468+OR+g1470+OR+g1471+OR+g1475+OR+g1476+OR+g1477+OR+g1478+OR+g1479+OR+g1481+OR+g1482+OR+g1483+OR+g1484+OR+g1485+OR+g1486+OR+g1487+OR+g1488+OR+g1489+OR+g1490+OR+g1491+OR+g1492+OR+g1493+OR+g1495+OR+g1497+OR+g1499+OR+g1501+OR+g1503+OR+g1504+OR+g1506+OR+g1508+OR+g1511+OR+g1512+OR+g1513+OR+g1516+OR+g1522+OR+g1535+OR+g1536+OR+g1537+OR+g1539+OR+g1540+OR+g1541+OR+g1542+OR+g1547+OR+g1549+OR+g1551+OR+g1553+OR+g1555+OR+g1557+OR+g1559+OR+g1561+OR+g1563+OR+g1565+OR+g1567+OR+g1569+OR+g1571+OR+g1573+OR+g1580+OR+g1583+OR+g1588+OR+g1590+OR+g1592+OR+g1594+OR+g1595+OR+g1596+OR+g1598+OR+g1599+OR+g1600+OR+g1601+OR+g1602+OR+g1604+OR+g1606+OR+g1610+OR+g1611+OR+g1612+OR+g1613+OR+g1616+OR+g1619+OR+g1622+OR+g1624+OR+g1625+OR+g1626+OR+g1628+OR+g1629+OR+g1631+OR+g1632+OR+g1692+OR+g1694+OR+g1695+OR+g1697+OR+g1705+OR+g1706+OR+g1707+OR+g1708+OR+g1711+OR+g1715+OR+g1717+OR+g1719+OR+g1721+OR+g1722+OR+g1723+OR+g1724+OR+g1725+OR+g1726+OR+g1727+OR+g1731+OR+g1732+OR+g1736+OR+g1737+OR+g1738+OR+g1740+OR+g1742+OR+g1743+OR+g1753+OR+g1755+OR+g1758+OR+g1759+OR+g1764+OR+g1766+OR+g1769+OR+g1774+OR+g1782+OR+g1794+OR+g1796+OR+g1797+OR+g1814+OR+g1818+OR+g1826+OR+g1853+OR+g1855+OR+g1857+OR+g1858+OR+g1859+OR+g1860+OR+g1861+OR+g1863+OR+g1864+OR+g1865+OR+g1867+OR+g1869+OR+g1871+OR+g1873+OR+g1875+OR+g1877+OR+g1879+OR+g1881+OR+g1883+OR+g1884+OR+g1885+OR+g1887+OR+g1889+OR+g1891+OR+g1892+OR+g1894+OR+g1896+OR+g1898+OR+g1900+OR+g1902+OR+g1907+OR+g1910+OR+g1915+OR+g1916+OR+g1917+OR+g1918+OR+g1929+OR+g1931+OR+g1932+OR+g1933+OR+g1934+OR+g1936+OR+g1937+OR+g1938+OR+g1939+OR+g1940+OR+g1942+OR+g1944+OR+g1945+OR+g1948+OR+g1950+OR+g1955+OR+g1961+OR+g1962+OR+g1964+OR+g1966+OR+g1968+OR+g1970+OR+g1972+OR+g1974+OR+g1976+OR+g1979+OR+g1982+OR+g1984+OR+g1985+OR+g1986+OR+g1987+OR+g1989+OR+g1991+OR+g1996+OR+g2003+OR+g2007+OR+g2011+OR+g2019+OR+g2020+OR+g2046)&sort=dateIssued.year_sort+desc&rows=1&wt=javabin&version=2} hits=56080 status=0 QTime=3
    +
    +

    2016-11-30

    diff --git a/docs/2016-12/index.html b/docs/2016-12/index.html index 084f5039c..760aa49e6 100644 --- a/docs/2016-12/index.html +++ b/docs/2016-12/index.html @@ -8,9 +8,7 @@ @@ -35,9 +30,7 @@ Another worrying error from dspace.log is: - + @@ -134,27 +124,21 @@ Another worrying error from dspace.log is:

    -

    2016-12-02

    - +

    2016-12-02

    2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
    -
    - -
  • I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade

  • - -
  • I’ve raised a ticket with Atmire to ask

  • - -
  • Another worrying error from dspace.log is:

  • + -
    org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceObjectDatasetGenerator.toDatasetQuery(Lorg/dspace/core/Context;)Lcom/atmire/statistics/content/DatasetQuery;
             at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:972)
             at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:852)
    @@ -241,39 +225,28 @@ Caused by: java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceOb
             at org.springframework.web.servlet.DispatcherServlet.render(DispatcherServlet.java:1180)
             at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:950)
             ... 35 more
    -
    - - - -

    2016-12-07

    - +

    2016-12-07

    {
    -"responseHeader": {
    -"status": 0,
    -"QTime": 1,
    -"params": {
    -  "q": "id:0b4fcbc1-d930-4319-9b4d-ea1553cca70b",
    -  "indent": "true",
    -  "wt": "json",
    -  "_": "1481102189244"
    -}
    -},
    -"response": {
    -"numFound": 1,
    -"start": 0,
    -"docs": [
    -  {
    -    "id": "0b4fcbc1-d930-4319-9b4d-ea1553cca70b",
    -    "field": "dc_contributor_author",
    -    "value": "Grace, D.",
    -    "deleted": false,
    -    "creation_date": "2016-11-10T15:13:40.318Z",
    -    "last_modified_date": "2016-11-10T15:13:40.318Z",
    -    "authority_type": "person",
    -    "first_name": "D.",
    -    "last_name": "Grace"
    +  "responseHeader": {
    +    "status": 0,
    +    "QTime": 1,
    +    "params": {
    +      "q": "id:0b4fcbc1-d930-4319-9b4d-ea1553cca70b",
    +      "indent": "true",
    +      "wt": "json",
    +      "_": "1481102189244"
    +    }
    +  },
    +  "response": {
    +    "numFound": 1,
    +    "start": 0,
    +    "docs": [
    +      {
    +        "id": "0b4fcbc1-d930-4319-9b4d-ea1553cca70b",
    +        "field": "dc_contributor_author",
    +        "value": "Grace, D.",
    +        "deleted": false,
    +        "creation_date": "2016-11-10T15:13:40.318Z",
    +        "last_modified_date": "2016-11-10T15:13:40.318Z",
    +        "authority_type": "person",
    +        "first_name": "D.",
    +        "last_name": "Grace"
    +      }
    +    ]
       }
    -]
     }
    -}
    -
    - -
  • I think I can just update the value, first_name, and last_name fields…

  • - -
  • The update syntax should be something like this, but I’m getting errors from Solr:

    - +
    $ curl 'localhost:8081/solr/authority/update?commit=true&wt=json&indent=true' -H 'Content-type:application/json' -d '[{"id":"1","price":{"set":100}}]'
     {
    -"responseHeader":{
    -"status":400,
    -"QTime":0},
    -"error":{
    -"msg":"Unexpected character '[' (code 91) in prolog; expected '<'\n at [row,col {unknown-source}]: [1,1]",
    -"code":400}}
    -
  • - -
  • When I try using the XML format I get an error that the updateLog needs to be configured for that core

  • - -
  • Maybe I can just remove the authority UUID from the records, run the indexing again so it creates a new one for each name variant, then match them correctly?

    - + "responseHeader":{ + "status":400, + "QTime":0}, + "error":{ + "msg":"Unexpected character '[' (code 91) in prolog; expected '<'\n at [row,col {unknown-source}]: [1,1]", + "code":400}} +
    dspace=# update metadatavalue set authority=null, confidence=-1 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
     UPDATE 561
    -
  • - -
  • Then I’ll reindex discovery and authority and see how the authority Solr core looks

  • - -
  • After this, now there are authorities for some of the “Grace, D.” and “Grace, Delia” text_values in the database (the first version is actually the same authority that already exists in the core, so it was just added back to some text_values, but the second one is new):

    - +
    $ curl 'localhost:8081/solr/authority/select?q=id%3A18ea1525-2513-430a-8817-a834cd733fbc&wt=json&indent=true'
     {
    -"responseHeader":{
    -"status":0,
    -"QTime":0,
    -"params":{
    -  "q":"id:18ea1525-2513-430a-8817-a834cd733fbc",
    -  "indent":"true",
    -  "wt":"json"}},
    -"response":{"numFound":1,"start":0,"docs":[
    -  {
    -    "id":"18ea1525-2513-430a-8817-a834cd733fbc",
    -    "field":"dc_contributor_author",
    -    "value":"Grace, Delia",
    -    "deleted":false,
    -    "creation_date":"2016-12-07T10:54:34.356Z",
    -    "last_modified_date":"2016-12-07T10:54:34.356Z",
    -    "authority_type":"person",
    -    "first_name":"Delia",
    -    "last_name":"Grace"}]
    -}}
    -
  • - -
  • So now I could set them all to this ID and the name would be ok, but there has to be a better way!

  • - -
  • In this case it seems that since there were also two different IDs in the original database, I just picked the wrong one!

  • - -
  • Better to use:

    - + "responseHeader":{ + "status":0, + "QTime":0, + "params":{ + "q":"id:18ea1525-2513-430a-8817-a834cd733fbc", + "indent":"true", + "wt":"json"}}, + "response":{"numFound":1,"start":0,"docs":[ + { + "id":"18ea1525-2513-430a-8817-a834cd733fbc", + "field":"dc_contributor_author", + "value":"Grace, Delia", + "deleted":false, + "creation_date":"2016-12-07T10:54:34.356Z", + "last_modified_date":"2016-12-07T10:54:34.356Z", + "authority_type":"person", + "first_name":"Delia", + "last_name":"Grace"}] + }} +
    dspace#= update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
    -
  • - -
  • This proves that unifying author name varieties in authorities is easy, but fixing the name in the authority is tricky!

  • - -
  • Perhaps another way is to just add our own UUID to the authority field for the text_value we like, then re-index authority so they get synced from PostgreSQL to Solr, then set the other text_values to use that authority ID

  • - -
  • Deploy MQM WARN fix on CGSpace (#289)

  • - -
  • Deploy “take task” hack/fix on CGSpace (#290)

  • - -
  • I ran the following author corrections and then reindexed discovery:

    - +
    update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%';
     update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Hoek, R%';
     update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%an der Hoek%' and text_value !~ '^.*W\.?$';
     update metadatavalue set authority='18349f29-61b1-44d7-ac60-89e55546e812', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne, P%';
     update metadatavalue set authority='0d8369bb-57f7-4b2f-92aa-af820b183aca', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thornton, P%';
     update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
    -
  • - - -

    2016-12-08

    - +

    2016-12-08

    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne%';
    -text_value    |              authority               | confidence
    +    text_value    |              authority               | confidence
     ------------------+--------------------------------------+------------
    -Thorne, P.J.     | 18349f29-61b1-44d7-ac60-89e55546e812 |        600
    -Thorne           | 18349f29-61b1-44d7-ac60-89e55546e812 |        600
    -Thorne-Lyman, A. | 0781e13a-1dc8-4e3f-82e8-5c422b44a344 |         -1
    -Thorne, M. D.    | 54c52649-cefd-438d-893f-3bcef3702f07 |         -1
    -Thorne, P.J      | 18349f29-61b1-44d7-ac60-89e55546e812 |        600
    -Thorne, P.       | 18349f29-61b1-44d7-ac60-89e55546e812 |        600
    + Thorne, P.J.     | 18349f29-61b1-44d7-ac60-89e55546e812 |        600
    + Thorne           | 18349f29-61b1-44d7-ac60-89e55546e812 |        600
    + Thorne-Lyman, A. | 0781e13a-1dc8-4e3f-82e8-5c422b44a344 |         -1
    + Thorne, M. D.    | 54c52649-cefd-438d-893f-3bcef3702f07 |         -1
    + Thorne, P.J      | 18349f29-61b1-44d7-ac60-89e55546e812 |        600
    + Thorne, P.       | 18349f29-61b1-44d7-ac60-89e55546e812 |        600
     (6 rows)
    -
    - -
  • I generated a new UUID using uuidgen | tr [A-Z] [a-z] and set it along with correct name variation for all records:

    - +
    dspace=# update metadatavalue set authority='b2f7603d-2fb5-4018-923a-c4ec8d85b3bb', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812';
     UPDATE 43
    -
  • - -
  • Apparently we also need to normalize Phil Thornton’s names to Thornton, Philip K.:

    - +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
    - text_value      |              authority               | confidence
    +     text_value      |              authority               | confidence
     ---------------------+--------------------------------------+------------
    -Thornton, P         | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    -Thornton, P K.      | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    -Thornton, P K       | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    -Thornton. P.K.      | 3e1e6639-d4fb-449e-9fce-ce06b5b0f702 |         -1
    -Thornton, P K .     | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    -Thornton, P.K.      | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    -Thornton, P.K       | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    -Thornton, Philip K  | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    -Thornton, Philip K. | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    -Thornton, P. K.     | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    + Thornton, P         | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    + Thornton, P K.      | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    + Thornton, P K       | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    + Thornton. P.K.      | 3e1e6639-d4fb-449e-9fce-ce06b5b0f702 |         -1
    + Thornton, P K .     | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    + Thornton, P.K.      | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    + Thornton, P.K       | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    + Thornton, Philip K  | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    + Thornton, Philip K. | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    + Thornton, P. K.     | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
     (10 rows)
    -
  • - -
  • Seems his original authorities are using an incorrect version of the name so I need to generate another UUID and tie it to the correct name, then reindex:

    - +
    dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab764', text_value='Thornton, Philip K.', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
     UPDATE 362
    -
  • - -
  • It seems that, when you are messing with authority and author text values in the database, it is better to run authority reindex first (postgres→solr authority core) and then Discovery reindex (postgres→solr Discovery core)

  • - -
  • Everything looks ok after authority and discovery reindex

  • - -
  • In other news, I think we should really be using more RAM for PostgreSQL’s shared_buffers

  • - -
  • The PostgreSQL documentation recommends using 25% of the system’s RAM on dedicated systems, but we should use a bit less since we also have a massive JVM heap and also benefit from some RAM being used by the OS cache

  • + - -

    2016-12-09

    - +

    2016-12-09

    dspace=# update metadatavalue set authority='34df639a-42d8-4867-a3f2-1892075fcb3f', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812' or authority='021cd183-946b-42bb-964e-522ebff02993';
     dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab764', text_value='Thornton, Philip K.', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
    -
    - -
  • The authority IDs were different now than when I was looking a few days ago so I had to adjust them here

  • + - -

    2016-12-11

    - +

    2016-12-11

    - -

    postgres_bgwriter-week -postgres_connections_ALL-week

    - +

    postgres_bgwriter-week +postgres_connections_ALL-week

    International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024eb2::600
     International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024eb2::500
     International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024eb2::0
    -
    - -
  • Some in the same dc.contributor.author field, and some in others like dc.contributor.author[en_US] etc

  • - -
  • Removing the duplicates in OpenRefine and uploading a CSV to DSpace says “no changes detected”

  • - -
  • Seems like the only way to sortof clean these up would be to start in SQL:

    - +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'International Center for Tropical Agriculture';
    -              text_value                   |              authority               | confidence
    +                  text_value                   |              authority               | confidence
     -----------------------------------------------+--------------------------------------+------------
    -International Center for Tropical Agriculture | cc726b78-a2f4-4ee9-af98-855c2ea31c36 |         -1
    -International Center for Tropical Agriculture |                                      |        600
    -International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 |        500
    -International Center for Tropical Agriculture | cc726b78-a2f4-4ee9-af98-855c2ea31c36 |        600
    -International Center for Tropical Agriculture |                                      |         -1
    -International Center for Tropical Agriculture | cc726b78-a2f4-4ee9-af98-855c2ea31c36 |        500
    -International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 |        600
    -International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 |         -1
    -International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 |          0
    + International Center for Tropical Agriculture | cc726b78-a2f4-4ee9-af98-855c2ea31c36 |         -1
    + International Center for Tropical Agriculture |                                      |        600
    + International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 |        500
    + International Center for Tropical Agriculture | cc726b78-a2f4-4ee9-af98-855c2ea31c36 |        600
    + International Center for Tropical Agriculture |                                      |         -1
    + International Center for Tropical Agriculture | cc726b78-a2f4-4ee9-af98-855c2ea31c36 |        500
    + International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 |        600
    + International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 |         -1
    + International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 |          0
     dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = 'International Center for Tropical Agriculture';
     UPDATE 1693
     dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', text_value='International Center for Tropical Agriculture', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%CIAT%';
     UPDATE 35
    -
  • - -
  • Work on article for KM4Dev journal

  • + - -

    2016-12-13

    - +

    2016-12-13

    - -

    postgres_bgwriter-week -postgres_connections_ALL-week

    - +

    postgres_bgwriter-week +postgres_connections_ALL-week

    # find /home/dspacetest.cgiar.org/log -regextype posix-extended -iregex ".*\.log.*" ! -iregex ".*dspace\.log.*" ! -iregex ".*\.(gz|lrz|lzo|xz)" ! -newermt "Yesterday" -exec schedtool -B -e ionice -c2 -n7 xz {} \;
    -
    - -
  • Since there is xzgrep and xzless we can actually just zip them after one day, why not?!

  • - -
  • We can keep the zipped ones for two weeks just in case we need to look for errors, etc, and delete them after that

  • - -
  • I use schedtool -B and ionice -c2 -n7 to set the CPU scheduling to SCHED_BATCH and the IO to best effort which should, in theory, impact important system processes like Tomcat and PostgreSQL less

  • - -
  • When the tasks are running you can see that the policies do apply:

    - +
    $ schedtool $(ps aux | grep "xz /home" | grep -v grep | awk '{print $2}') && ionice -p $(ps aux | grep "xz /home" | grep -v grep | awk '{print $2}')
     PID 17049: PRIO   0, POLICY B: SCHED_BATCH   , NICE   0, AFFINITY 0xf
     best-effort: prio 7
    -
  • - -
  • All in all this should free up a few gigs (we were at 9.3GB free when I started)

  • - -
  • Next thing to look at is whether we need Tomcat’s access logs

  • - -
  • I just looked and it seems that we saved 10GB by zipping these logs

  • - -
  • Some users pointed out issues with the “most popular” stats on a community or collection

  • - -
  • This error appears in the logs when you try to view them:

    - +
    2016-12-13 21:17:37,486 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
     org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceObjectDatasetGenerator.toDatasetQuery(Lorg/dspace/core/Context;)Lcom/atmire/statistics/content/DatasetQuery;
     	at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:972)
    @@ -741,69 +636,54 @@ Caused by: java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceOb
     	at com.atmire.statistics.mostpopular.JSONStatsMostPopularGenerator.generate(SourceFile:246)
     	at com.atmire.app.xmlui.aspect.statistics.JSONStatsMostPopular.generate(JSONStatsMostPopular.java:145)
     	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -
  • - -
  • It happens on development and production, so I will have to ask Atmire

  • - -
  • Most likely an issue with installation/configuration

  • + - -

    2016-12-14

    - +

    2016-12-14

    - -

    2016-12-15

    - +

    2016-12-15

    - -

    Select all items with "rangelands" in metadata -Add RANGELANDS ILRI subject

    - -

    2016-12-18

    - +

    Select all items with “rangelands” in metadata +Add RANGELANDS ILRI subject

    +

    2016-12-18

    dspace=# update metadatavalue set authority='5ff35043-942e-4d0a-b377-4daed6e3c1a3', confidence=600, text_value='Duncan, Alan' where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*Duncan,? A.*';
     UPDATE 204
     dspace=# update metadatavalue set authority='46804b53-ea30-4a85-9ccf-b79a35816fa9', confidence=600, text_value='Mekonnen, Kindu' where resource_type_id=2 and metadata_field_id=3 and text_value like '%Mekonnen, K%';
     UPDATE 89
     dspace=# update metadatavalue set authority='f840da02-26e7-4a74-b7ba-3e2b723f3684', confidence=600, text_value='Lukuyu, Ben A.' where resource_type_id=2 and metadata_field_id=3 and text_value like '%Lukuyu, B%';
     UPDATE 140
    -
    - -
  • Generated a new UUID for Ben using uuidgen | tr [A-Z] [a-z] as the one in Solr had his ORCID but the name format was incorrect

  • - -
  • In theory DSpace should be able to check names from ORCID and update the records in the database, but I find that this doesn’t work (see Jira bug DS-3302)

  • - -
  • I need to run these updates along with the other one for CIAT that I found last week

  • - -
  • Enable OCSP stapling for hosts >= Ubuntu 16.04 in our Ansible playbooks (#76)

  • - -
  • Working for DSpace Test on the second response:

    - +
    $ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgiar.org -tls1_2 -tlsextdebug -status
     ...
     OCSP response: no response sent
    @@ -811,19 +691,16 @@ $ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgia
     ...
     OCSP Response Data:
     ...
    -Cert Status: good
    -
  • - -
  • Migrate CGSpace to new server, roughly following these steps:

  • - -
  • On old server:

    - + Cert Status: good +
    # service tomcat7 stop
     # /home/backup/scripts/postgres_backup.sh
    -
  • - -
  • On new server:

    - +
    # systemctl stop tomcat7
     # rsync -4 -av --delete 178.79.187.182:/home/cgspace.cgiar.org/assetstore/ /home/cgspace.cgiar.org/assetstore/
     # rsync -4 -av --delete 178.79.187.182:/home/backup/ /home/backup/
    @@ -848,44 +725,34 @@ $ cd src/git/DSpace/dspace/target/dspace-installer
     $ ant update clean_backups
     $ exit
     # systemctl start tomcat7
    -
  • - -
  • It took about twenty minutes and afterwards I had to check a few things, like:

    - + - -

    2016-12-22

    - +
  • + +

    2016-12-22

    $ [dspace]/bin/dspace metadata-export -a -f /tmp/iita.csv -i 10568/68616
    -
    - - -

    2016-12-28

    - +

    2016-12-28

    - -

    munin postgres stats

    - +

    munin postgres stats

    diff --git a/docs/2017-01/index.html b/docs/2017-01/index.html index 614f407ed..a4851cf20 100644 --- a/docs/2017-01/index.html +++ b/docs/2017-01/index.html @@ -8,10 +8,9 @@ @@ -22,12 +21,11 @@ I asked on the dspace-tech mailing list because it seems to be broken, and actua - + @@ -108,77 +106,71 @@ I asked on the dspace-tech mailing list because it seems to be broken, and actua

    -

    2017-01-02

    - +

    2017-01-02

    - -

    2017-01-04

    - +

    2017-01-04

    $ JAVA_OPTS="-Xms768m -Xmx768m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace stats-util -s
     Moving: 9318 into core statistics-2016
     Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016
     org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016
    -    at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
    -    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
    -    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
    -    at org.dspace.statistics.SolrLogger.shardSolrIndex(SourceFile:2291)
    -    at org.dspace.statistics.util.StatisticsClient.main(StatisticsClient.java:106)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -    at java.lang.reflect.Method.invoke(Method.java:498)
    -    at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    -    at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    +        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
    +        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
    +        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
    +        at org.dspace.statistics.SolrLogger.shardSolrIndex(SourceFile:2291)
    +        at org.dspace.statistics.util.StatisticsClient.main(StatisticsClient.java:106)
    +        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    +        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    +        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    +        at java.lang.reflect.Method.invoke(Method.java:498)
    +        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    +        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
     Caused by: org.apache.http.client.ClientProtocolException
    -    at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:867)
    -    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
    -    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
    -    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
    -    at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448)
    -    ... 10 more
    +        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:867)
    +        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
    +        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
    +        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
    +        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448)
    +        ... 10 more
     Caused by: org.apache.http.client.NonRepeatableRequestException: Cannot retry request with a non-repeatable request entity.  The cause lists the reason the original request failed.
    -    at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:659)
    -    at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
    -    at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
    -    ... 14 more
    +        at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:659)
    +        at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
    +        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
    +        ... 14 more
     Caused by: java.net.SocketException: Broken pipe (Write failed)
    -    at java.net.SocketOutputStream.socketWrite0(Native Method)
    -    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
    -    at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
    -    at org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:181)
    -    at org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:124)
    -    at org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:181)
    -    at org.apache.http.entity.InputStreamEntity.writeTo(InputStreamEntity.java:132)
    -    at org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:89)
    -    at org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108)
    -    at org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:117)
    -    at org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:265)
    -    at org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:203)
    -    at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:236)
    -    at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121)
    -    at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
    -    ... 16 more
    -
    - -
  • And the DSpace log shows:

    - + at java.net.SocketOutputStream.socketWrite0(Native Method) + at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) + at java.net.SocketOutputStream.write(SocketOutputStream.java:153) + at org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:181) + at org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:124) + at org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:181) + at org.apache.http.entity.InputStreamEntity.writeTo(InputStreamEntity.java:132) + at org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:89) + at org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108) + at org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:117) + at org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:265) + at org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:203) + at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:236) + at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121) + at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685) + ... 16 more +
    2017-01-04 22:39:05,412 INFO  org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
     2017-01-04 22:39:05,412 INFO  org.dspace.statistics.SolrLogger @ Moving: 9318 records into core statistics-2016
     2017-01-04 22:39:07,310 INFO  org.apache.http.impl.client.SystemDefaultHttpClient @ I/O exception (java.net.SocketException) caught when processing request to {}->http://localhost:8081: Broken pipe (Write failed)
     2017-01-04 22:39:07,310 INFO  org.apache.http.impl.client.SystemDefaultHttpClient @ Retrying request to {}->http://localhost:8081
    -
  • - -
  • Despite failing instantly, a statistics-2016 directory was created, but it only has a data dir (no conf)

  • - -
  • The Tomcat access logs show more:

    - +
    127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/statistics/select?q=type%3A2+AND+id%3A1&wt=javabin&version=2 HTTP/1.1" 200 107
     127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/statistics/select?q=*%3A*&rows=0&facet=true&facet.range=time&facet.range.start=NOW%2FYEAR-17YEARS&facet.range.end=NOW%2FYEAR%2B0YEARS&facet.range.gap=%2B1YEAR&facet.mincount=1&wt=javabin&version=2 HTTP/1.1" 200 423
     127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/admin/cores?action=STATUS&core=statistics-2016&indexInfo=true&wt=javabin&version=2 HTTP/1.1" 200 77
    @@ -188,228 +180,163 @@ Caused by: java.net.SocketException: Broken pipe (Write failed)
     127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] "POST /solr//statistics-2016/update/csv?commit=true&softCommit=false&waitSearcher=true&f.previousWorkflowStep.split=true&f.previousWorkflowStep.separator=%7C&f.previousWorkflowStep.encapsulator=%22&f.actingGroupId.split=true&f.actingGroupId.separator=%7C&f.actingGroupId.encapsulator=%22&f.containerCommunity.split=true&f.containerCommunity.separator=%7C&f.containerCommunity.encapsulator=%22&f.range.split=true&f.range.separator=%7C&f.range.encapsulator=%22&f.containerItem.split=true&f.containerItem.separator=%7C&f.containerItem.encapsulator=%22&f.p_communities_map.split=true&f.p_communities_map.separator=%7C&f.p_communities_map.encapsulator=%22&f.ngram_query_search.split=true&f.ngram_query_search.separator=%7C&f.ngram_query_search.encapsulator=%22&f.containerBitstream.split=true&f.containerBitstream.separator=%7C&f.containerBitstream.encapsulator=%22&f.owningItem.split=true&f.owningItem.separator=%7C&f.owningItem.encapsulator=%22&f.actingGroupParentId.split=true&f.actingGroupParentId.separator=%7C&f.actingGroupParentId.encapsulator=%22&f.text.split=true&f.text.separator=%7C&f.text.encapsulator=%22&f.simple_query_search.split=true&f.simple_query_search.separator=%7C&f.simple_query_search.encapsulator=%22&f.owningComm.split=true&f.owningComm.separator=%7C&f.owningComm.encapsulator=%22&f.owner.split=true&f.owner.separator=%7C&f.owner.encapsulator=%22&f.filterquery.split=true&f.filterquery.separator=%7C&f.filterquery.encapsulator=%22&f.p_group_map.split=true&f.p_group_map.separator=%7C&f.p_group_map.encapsulator=%22&f.actorMemberGroupId.split=true&f.actorMemberGroupId.separator=%7C&f.actorMemberGroupId.encapsulator=%22&f.bitstreamId.split=true&f.bitstreamId.separator=%7C&f.bitstreamId.encapsulator=%22&f.group_name.split=true&f.group_name.separator=%7C&f.group_name.encapsulator=%22&f.p_communities_name.split=true&f.p_communities_name.separator=%7C&f.p_communities_name.encapsulator=%22&f.query.split=true&f.query.separator=%7C&f.query.encapsulator=%22&f.workflowStep.split=true&f.workflowStep.separator=%7C&f.workflowStep.encapsulator=%22&f.containerCollection.split=true&f.containerCollection.separator=%7C&f.containerCollection.encapsulator=%22&f.complete_query_search.split=true&f.complete_query_search.separator=%7C&f.complete_query_search.encapsulator=%22&f.p_communities_id.split=true&f.p_communities_id.separator=%7C&f.p_communities_id.encapsulator=%22&f.rangeDescription.split=true&f.rangeDescription.separator=%7C&f.rangeDescription.encapsulator=%22&f.group_id.split=true&f.group_id.separator=%7C&f.group_id.encapsulator=%22&f.bundleName.split=true&f.bundleName.separator=%7C&f.bundleName.encapsulator=%22&f.ngram_simplequery_search.split=true&f.ngram_simplequery_search.separator=%7C&f.ngram_simplequery_search.encapsulator=%22&f.group_map.split=true&f.group_map.separator=%7C&f.group_map.encapsulator=%22&f.owningColl.split=true&f.owningColl.separator=%7C&f.owningColl.encapsulator=%22&f.p_group_id.split=true&f.p_group_id.separator=%7C&f.p_group_id.encapsulator=%22&f.p_group_name.split=true&f.p_group_name.separator=%7C&f.p_group_name.encapsulator=%22&wt=javabin&version=2 HTTP/1.1" 409 156
     127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] "POST /solr/datatables/update?wt=javabin&version=2 HTTP/1.1" 200 41
     127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] "POST /solr/datatables/update HTTP/1.1" 200 40
    -
  • - -
  • Very interesting… it creates the core and then fails somehow

  • + - -

    2017-01-08

    - +

    2017-01-08

    - -

    2017-01-09

    - +

    2017-01-09

    - -

    Crazy item mapping

    - -

    2017-01-10

    - +

    Crazy item mapping

    +

    2017-01-10

    - -

    2017-01-11

    - +
    dspace=#  select * from collection2item where item_id = '80596';
    +
    +
    dspace=# delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
    +
    +

    2017-01-11

    - -

    2017-01-13

    - +
    Traceback (most recent call last):
    +  File "./fix-metadata-values.py", line 80, in <module>
    +    print("Fixing {} occurences of: {}".format(records_to_fix, record[0]))
    +UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15: ordinal not in range(128)
    +
    +
    print("Fixing {} occurences of: {}".format(records_to_fix, record[0].encode('utf-8')))
    +
    +
    $ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'fuuu'
    +
    +
    dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
    +
    +

    2017-01-13

    - -

    2017-01-16

    - +

    2017-01-16

    /* 184 in correct mappings: https://cgspace.cgiar.org/handle/10568/78596 */
     delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
     /* 1 incorrect mapping: https://cgspace.cgiar.org/handle/10568/78658 */
     delete from collection2item where id = '91082';
    -
    - - -

    2017-01-17

    - +

    2017-01-17

    value.replace("'",'%27')
    -
    - -
  • Add the item’s Type to the filename column as a hint to SAF Builder so it can set a more useful description field:

    - +
    value + "__description:" + cells["dc.type"].value
    -
  • - -
  • Test importing of the new CIAT records (actually there are 232, not 234):

    - +
    $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/dspacetest.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/79042 --source /home/aorth/CIAT_234/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &> /tmp/ciat.log
    -
  • - -
  • Many of the PDFs are 20, 30, 40, 50+ MB, which makes a total of 4GB

  • - -
  • These are scanned from paper and likely have no compression, so we should try to test if these compression techniques help without comprimising the quality too much:

    - +
    $ convert -compress Zip -density 150x150 input.pdf output.pdf
     $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf
    -
  • - -
  • Somewhere on the Internet suggested using a DPI of 144

  • + - -

    2017-01-19

    - +

    2017-01-19

    $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/68704 --source /home/aorth/CIAT_232/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &> /tmp/ciat.log
    -
    - - -

    2017-01-22

    - +

    2017-01-22

    - -

    2017-01-23

    - +

    2017-01-23

    $ for community in 10568/171 10568/27868 10568/231 10568/27869 10568/150 10568/230 10568/32724 10568/172; do /home/cgspace.cgiar.org/bin/dspace community-filiator --remove --parent=10568/27866 --child="$community" && /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/79164 --child="$community"; done
    -
    - -
  • Move some collections with move-collections.sh using the following config:

    - +
    10568/42161 10568/171 10568/79341
     10568/41914 10568/171 10568/79340
    -
  • - - -

    2017-01-24

    - +

    2017-01-24

    - -

    2017-01-25

    - +
    $ ./fix-metadata-values.py -i /tmp/fix-49-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'password'
    +
    +
    dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
    +
    +

    2017-01-25

    +
  • But now we have a new issue with the “Types” in Content statistics not being respected—we only get the defaults, despite having custom settings in dspace/config/modules/atmire-cua.cfg
  • - -

    2017-01-27

    - +

    2017-01-27

    - -

    2017-01-28

    - +

    2017-01-28

    - -

    2017-01-29

    - +

    2017-01-29

    +
    $ ./fix-metadata-values.py -i /tmp/CCAFS-Authors-Feb-7.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu
    +

    2017-02-09

    - -

    2017-02-10

    - +
    $ ./fix-metadata-values.py -i ccafs-flagships-feb7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
    +

    2017-02-10

    - -

    2017-02-14

    - +

    2017-02-14

    +
  • I still need to test these, especially as the last two which change some stuff with Solr maintenance
  • - -

    2017-02-15

    - +

    2017-02-15

    - -

    2017-02-16

    - +

    2017-02-16

    - -

    CGSpace meminfo

    - +

    CGSpace meminfo

    - -

    CGSpace CPU

    - +

    CGSpace CPU

    handle.canonical.prefix = https://hdl.handle.net/
    -
    - -
  • And then a SQL command to update existing records:

    - +
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://hdl.handle.net', 'https://hdl.handle.net') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'uri');
     UPDATE 58193
    -
  • - -
  • Seems to work fine!

  • - -
  • I noticed a few items that have incorrect DOI links (dc.identifier.doi), and after looking in the database I see there are over 100 that are missing the scheme or are just plain wrong:

    - +
    dspace=# select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value not like 'http%://%';
    -
  • - -
  • This will replace any that begin with 10. and change them to https://dx.doi.org/10.:

    - +
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^10\..+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like '10.%';
    -
  • - -
  • This will get any that begin with doi:10. and change them to https://dx.doi.org/10.x:

    - +
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '^doi:(10\..+$)', 'https://dx.doi.org/\1') where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'doi:10%';
    -
  • - -
  • Fix DOIs like dx.doi.org/10. to be https://dx.doi.org/10.:

    - +
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^dx.doi.org/.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'dx.doi.org/%';
    -
  • - -
  • Fix DOIs like http//:

    - +
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '^http//(dx.doi.org/.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'http//%';
    -
  • - -
  • Fix DOIs like dx.doi.org./:

    - +
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^dx.doi.org\./.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'dx.doi.org./%'
     
    -
  • - -
  • Delete some invalid DOIs:

    - +
    dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value in ('DOI','CPWF Mekong','Bulawayo, Zimbabwe','bb');
    -
  • - -
  • Fix some other random outliers:

    - +
    dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.1016/j.aquaculture.2015.09.003' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'http:/dx.doi.org/10.1016/j.aquaculture.2015.09.003';
     dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.5337/2016.200' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'doi: https://dx.doi.org/10.5337/2016.200';
     dspace=# update metadatavalue set text_value = 'https://dx.doi.org/doi:10.1371/journal.pone.0062898' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'Http://dx.doi.org/doi:10.1371/journal.pone.0062898';
     dspace=# update metadatavalue set text_value = 'https://dx.doi.10.1016/j.cosust.2013.11.012' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'http:dx.doi.10.1016/j.cosust.2013.11.012';
     dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.1080/03632415.2014.883570' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'org/10.1080/03632415.2014.883570';
     dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.15446/agron.colomb.v32n3.46052' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'Doi: 10.15446/agron.colomb.v32n3.46052';
    -
  • - -
  • And do another round of http:// → https:// cleanups:

    - -
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://dx.doi.org') where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'http://dx.doi.org%';
    -
  • - -
  • Run all DOI corrections on CGSpace

  • - -
  • Something to think about here is to write a Curation Task in Java to do these sanity checks / corrections every night

  • - -
  • Then we could add a cron job for them and run them from the command line like:

    - -
    [dspace]/bin/dspace curate -t noop -i 10568/79891
    -
  • + - -

    2017-02-20

    - +
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://dx.doi.org') where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'http://dx.doi.org%';
    +
    +
    [dspace]/bin/dspace curate -t noop -i 10568/79891
    +

    2017-02-20

    $ python
     Python 3.6.0 (default, Dec 25 2016, 17:30:53)
     >>> print('Entwicklung & Ländlicher Raum')
     Entwicklung & Ländlicher Raum
     >>> print('Entwicklung & Ländlicher Raum'.encode())
     b'Entwicklung & L\xc3\xa4ndlicher Raum'
    -
    - -
  • So for now I will remove the encode call from the script (though it was never used on the versions on the Linux hosts), leading me to believe it really was a temporary problem, perhaps due to macOS or the Python build I was using.

  • + - -

    2017-02-21

    - +

    2017-02-21

    $ [dspace]/bin/dspace filter-media -f -i 10568/16856 -p "ImageMagick PDF Thumbnail"
     File: earlywinproposal_esa_postharvest.pdf.jpg
     FILTERED: bitstream 13787 (item: 10568/16881) and created 'earlywinproposal_esa_postharvest.pdf.jpg'
     File: postHarvest.jpg.jpg
     FILTERED: bitstream 16524 (item: 10568/24655) and created 'postHarvest.jpg.jpg'
    -
    - -
  • According to dspace.cfg the ImageMagick PDF Thumbnail plugin should only process PDFs:

    - +
    filter.org.dspace.app.mediafilter.ImageMagickImageThumbnailFilter.inputFormats = BMP, GIF, image/png, JPG, TIFF, JPEG, JPEG 2000
     filter.org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.inputFormats = Adobe PDF
    -
  • - -
  • I’ve sent a message to the mailing list and might file a Jira issue

  • - -
  • Ask Atmire about the failed interpolation of the dspace.internalUrl variable in atmire-cua.cfg

  • + - -

    2017-02-22

    - +

    2017-02-22

    - -

    2017-02-26

    - +

    2017-02-26

    dspace=# select distinct metadata_field_id from metadatavalue where resource_type_id=2 and text_value like 'http://hdl.handle.net%';
     dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://hdl.handle.net', 'https://hdl.handle.net') where resource_type_id=2 and metadata_field_id IN (25, 113, 179, 219, 220, 223) and text_value like 'http://hdl.handle.net%';
     UPDATE 58633
    -
    - -
  • This works but I’m thinking I’ll wait on the replacement as there are perhaps some other places that rely on http://hdl.handle.net (grep the code, it’s scary how many things are hard coded)

  • - -
  • Send message to dspace-tech mailing list with concerns about this

  • + - -

    2017-02-27

    - +

    2017-02-27

    $ openssl s_client -connect svcgroot2.cgiarad.org:3269
     CONNECTED(00000003)
     depth=0 CN = SVCGROOT2.CGIARAD.ORG
    @@ -396,15 +330,13 @@ verify error:num=21:unable to verify the first certificate
     verify return:1
     ---
     Certificate chain
    -0 s:/CN=SVCGROOT2.CGIARAD.ORG
    -i:/CN=CGIARAD-RDWA-CA
    + 0 s:/CN=SVCGROOT2.CGIARAD.ORG
    +   i:/CN=CGIARAD-RDWA-CA
     ---
    -
    - -
  • For some reason it is now signed by a private certificate authority

  • - -
  • This error seems to have started on 2017-02-25:

    - +
    $ grep -c "unable to find valid certification path" [dspace]/log/dspace.log.2017-02-*
     [dspace]/log/dspace.log.2017-02-01:0
     [dspace]/log/dspace.log.2017-02-02:0
    @@ -433,52 +365,37 @@ i:/CN=CGIARAD-RDWA-CA
     [dspace]/log/dspace.log.2017-02-25:7
     [dspace]/log/dspace.log.2017-02-26:8
     [dspace]/log/dspace.log.2017-02-27:90
    -
  • - -
  • Also, it seems that we need to use a different user for LDAP binds, as we’re still using the temporary one from the root migration, so maybe we can go back to the previous user we were using

  • - -
  • So it looks like the certificate is invalid AND the bind users we had been using were deleted

  • - -
  • Biruk Debebe recreated the bind user and now we are just waiting for CGNET to update their certificates

  • - -
  • Regarding the filter-media issue I found earlier, it seems that the ImageMagick PDF plugin will also process JPGs if they are in the “Content Files” (aka ORIGINAL) bundle

  • - -
  • The problem likely lies in the logic of ImageMagickThumbnailFilter.java, as ImageMagickPdfThumbnailFilter.java extends it

  • - -
  • Run CIAT corrections on CGSpace

    - -
    dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = 'International Center for Tropical Agriculture';
    -
  • - -
  • CGNET has fixed the certificate chain on their LDAP server

  • - -
  • Redeploy CGSpace and DSpace Test to on latest 5_x-prod branch with fixes for LDAP bind user

  • - -
  • Run all system updates on CGSpace server and reboot

  • + - -

    2017-02-28

    - +
    dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = 'International Center for Tropical Agriculture';
    +
    +

    2017-02-28

    dspace=# \copy (select resource_id, metadata_value_id from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='International Center for Tropical Agriculture') to /tmp/ciat.csv with csv;
     COPY 1968
    -
    - -
  • And then use awk to print the duplicate lines to a separate file:

    - -
    $ awk -F',' 'seen[$1]++' /tmp/ciat.csv > /tmp/ciat-dupes.csv
    -
  • - -
  • From that file I can create a list of 279 deletes and put them in a batch script like:

    - -
    delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and metadata_value_id=2742061;
    -
  • + +
    $ awk -F',' 'seen[$1]++' /tmp/ciat.csv > /tmp/ciat-dupes.csv
    +
    +
    delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and metadata_value_id=2742061;
    +
    diff --git a/docs/2017-03/index.html b/docs/2017-03/index.html index 7a3c0ad08..59a3b232d 100644 --- a/docs/2017-03/index.html +++ b/docs/2017-03/index.html @@ -8,13 +8,10 @@ @@ -39,13 +34,10 @@ $ identify ~/Desktop/alc_contrastes_desafios.jpg - + @@ -142,14 +132,11 @@ $ identify ~/Desktop/alc_contrastes_desafios.jpg

    -

    2017-03-01

    - +

    2017-03-01

    - -

    2017-03-02

    - +

    2017-03-02

    $ identify ~/Desktop/alc_contrastes_desafios.jpg
     /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
    -
    - - - - -

    2017-03-04

    - +

    2017-03-04

    - -

    2017-03-05

    - +

    2017-03-05

    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "LAND REFORM", "language": null}' | json_pp
    -
    - -
  • But there are hundreds of combinations of fields and values (like dc.subject and all the center subjects), and we can’t use wildcards in REST!

  • - -
  • Reading about enabling multiple handle prefixes in DSpace

  • - -
  • There is a mailing list thread from 2011 about it: http://dspace.2283337.n4.nabble.com/Multiple-handle-prefixes-merged-DSpace-instances-td3427192.html

  • - -
  • And a comment from Atmire’s Bram about it on the DSpace wiki: https://wiki.duraspace.org/display/DSDOC5x/Installing+DSpace?focusedCommentId=78163296#comment-78163296

  • - -
  • Bram mentions an undocumented configuration option handle.plugin.checknameauthority, but I noticed another one in dspace.cfg:

    - +
    # List any additional prefixes that need to be managed by this handle server
     # (as for examle handle prefix coming from old dspace repository merged in
     # that repository)
     # handle.additional.prefixes = prefix1[, prefix2]
    -
  • - -
  • Because of this I noticed that our Handle server’s config.dct was potentially misconfigured!

  • - -
  • We had some default values still present:

    - +
    "300:0.NA/YOUR_NAMING_AUTHORITY"
    -
  • - -
  • I’ve changed them to the following and restarted the handle server:

    - +
    "300:0.NA/10568"
    -
  • - -
  • In looking at all the configs I just noticed that we are not providing a DOI in the Google-specific metadata crosswalk

  • - -
  • From dspace/config/crosswalks/google-metadata.properties:

    - +
    google.citation_doi = cg.identifier.doi
    -
  • - -
  • This works, and makes DSpace output the following metadata on the item view page:

    - +
    <meta content="https://dx.doi.org/10.1186/s13059-017-1153-y" name="citation_doi">
    -
  • - -
  • Submitted and merged pull request for this: https://github.com/ilri/DSpace/pull/305

  • - -
  • Submit pull request to set the author separator for XMLUI item lists to a semicolon instead of “,”: https://github.com/ilri/DSpace/pull/306

  • - -
  • I want to show it briefly to Abenet and Peter to get feedback

  • + - -

    2017-03-06

    - +

    2017-03-06

    - -

    2017-03-07

    - +

    2017-03-07

    +
  • I need to talk to Michael and Peter to share the news, and discuss the structure of their community(s) and try some actual test data
  • -
  • We’ll need to do some data cleaning to make sure they are using the same fields we are, like dc.type and cg.identifier.status
  • +
  • We'll need to do some data cleaning to make sure they are using the same fields we are, like dc.type and cg.identifier.status
  • Another thing is that the import process creates new dc.date.accessioned and dc.date.available fields, so we end up with duplicates (is it important to preserve the originals for these?)
  • Report DS-3520 issue to Atmire
  • - -

    2017-03-08

    - +

    2017-03-08

    - -

    2017-03-09

    - +

    2017-03-09

    dspace=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship') group by text_value order by count desc) to /tmp/sponsorship.csv with csv;
     COPY 285
    -
    - - -

    2017-03-12

    - +

    2017-03-12

    $ ./fix-metadata-values.py -i Investors-Fix-51.csv -f dc.description.sponsorship -t Action -m 29 -d dspace -u dspace -p fuuuu
     $ ./delete-metadata-values.py -i Investors-Delete-121.csv -f dc.description.sponsorship -m 29 -d dspace -u dspace -p fuuu
    -
    - -
  • Generate a new list of unique sponsors so we can update the controlled vocabulary:

    - -
    dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship')) to /tmp/sponsorship.csv with csv;
    -
  • - -
  • Pull request for controlled vocabulary if Peter approves: https://github.com/ilri/DSpace/pull/308

  • - -
  • Review Sisay’s roots, tubers, and bananas (RTB) theme, which still needs some fixes to work properly: https://github.com/ilri/DSpace/pull/307

  • - -
  • Created an issue to track the progress on the Livestock CRP theme: https://github.com/ilri/DSpace/issues/309

  • - -
  • Created a basic theme for the Livestock CRP community

  • + - -

    Livestock CRP theme

    - -

    2017-03-15

    - +
    dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship')) to /tmp/sponsorship.csv with csv;
    +
    +

    Livestock CRP theme

    +

    2017-03-15

    - -

    2017-03-16

    - +

    2017-03-16

    - -

    2017-03-20

    - +

    2017-03-20

    - -

    2017-03-24

    - +

    2017-03-24

    - -

    2017-03-28

    - +

    2017-03-28

    $ ./fix-metadata-values.py -i ccafs-flagships-feb7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
    -
    - -
  • We’ve been waiting since February to run these

  • - -
  • Also, I generated a list of all CCAFS flagships because there are a dozen or so more than there should be:

    - +
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=210 group by text_value order by count desc) to /tmp/ccafs.csv with csv;
    -
  • - -
  • I sent a list to CCAFS people so they can tell me if some should be deleted or moved, etc

  • - -
  • Test, squash, and merge Sisay’s RTB theme into 5_x-prod: https://github.com/ilri/DSpace/pull/316

  • + - -

    2017-03-29

    - +

    2017-03-29

    - -

    2017-03-30

    - +
    dspace=# select case when metadata_schema_id=1 then 'dc' else 'cg' end as schema, element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
    +
    +
    dspace=# select coalesce(case when metadata_schema_id=1 then 'dc.' else 'cg.' end) || concat_ws('.', element, qualifier) as field, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
    +

    2017-03-30

    +
    dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=227;
    +

    2017-04-04

    $ grep -c profile /tmp/filter-media-cmyk.txt
     1584
    -
    - -
  • Trying to find a way to get the number of items submitted by a certain user in 2016

  • - -
  • It’s not possible in the DSpace search / module interfaces, but might be able to be derived from dc.description.provenance, as that field contains the name and email of the submitter/approver, ie:

    - +
    Submitted by Francesca Giampieri (fgiampieri) on 2016-01-19T13:56:43Z^M
     No. of bitstreams: 1^M
     ILAC_Brief21_PMCA.pdf: 113462 bytes, checksum: 249fef468f401c066a119f5db687add0 (MD5)
    -
  • - -
  • This SQL query returns fields that were submitted or approved by giampieri in 2016 and contain a “checksum” (ie, there was a bitstream in the submission):

    - -
    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
    -
  • - -
  • Then this one does the same, but for fields that don’t contain checksums (ie, there was no bitstream in the submission):

    - -
    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*' and text_value !~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
    -
  • - -
  • For some reason there seem to be way too many fields, for example there are 498 + 13 here, which is 511 items for just this one user.

  • - -
  • It looks like there can be a scenario where the user submitted AND approved it, so some records might be doubled…

  • - -
  • In that case it might just be better to see how many the user submitted (both with and without bitstreams):

    - -
    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*giampieri.*2016-.*';
    -
  • + - -

    2017-04-05

    - +
    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
    +
    +
    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*' and text_value !~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
    +
    +
    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*giampieri.*2016-.*';
    +

    2017-04-05

    $ grep -c profile /tmp/filter-media-cmyk.txt
     2505
    -
    - - -

    2017-04-06

    - +

    2017-04-06

    +
  • I need to look at the Munin graphs after a few days to see if the load has changed
  • Run system updates on DSpace Test and reboot the server
  • -
  • Discussing harvesting CIFOR’s DSpace via OAI
  • +
  • Discussing harvesting CIFOR's DSpace via OAI
  • Sisay added their OAI as a source to a new collection, but using the Simple Dublin Core method, so many fields are unqualified and duplicated
  • Looking at the documentation it seems that we probably want to be using DSpace Intermediate Metadata
  • - -

    2017-04-10

    - +

    2017-04-10

    +
  • Remove James from Linode access
  • Look into having CIFOR use a sub prefix of 10568 like 10568.01
  • Handle.net calls this “derived prefixes” and it seems this would work with DSpace if we wanted to go that route
  • CIFOR is starting to test aligning their metadata more with CGSpace/CG core
  • They shared a test item which is using cg.coverage.country, cg.subject.cifor, dc.subject, and dc.date.issued
  • -
  • Looking at their OAI I’m not sure it has updated as I don’t see the new fields: https://data.cifor.org/dspace/oai/request?verb=ListRecords&resumptionToken=oai_dc///col_11463_6/900
  • -
  • Maybe they need to make sure they are running the OAI cache refresh cron job, or maybe OAI doesn’t export these?
  • -
  • I added cg.subject.cifor to the metadata registry and I’m waiting for the harvester to re-harvest to see if it picks up more data now
  • -
  • Another possiblity is that we could use a cross walk… but I’ve never done it.
  • +
  • Looking at their OAI I'm not sure it has updated as I don't see the new fields: https://data.cifor.org/dspace/oai/request?verb=ListRecords&resumptionToken=oai_dc///col_11463_6/900
  • +
  • Maybe they need to make sure they are running the OAI cache refresh cron job, or maybe OAI doesn't export these?
  • +
  • I added cg.subject.cifor to the metadata registry and I'm waiting for the harvester to re-harvest to see if it picks up more data now
  • +
  • Another possiblity is that we could use a cross walk… but I've never done it.
  • - -

    2017-04-11

    - +

    2017-04-11

    - -

    2017-04-12

    - +

    2017-04-12

    + +
  • Looking at one of CGSpace's items in OAI it doesn't seem that metadata fields other than those in the DC schema are exported:
  • -
  • Side note: WTF, I just saw an item on CGSpace’s OAI that is using dc.cplace.country and dc.rplace.region, which we stopped using in 2016 after the metadata migrations:
  • - -

    stale metadata in OAI

    - + +
  • Side note: WTF, I just saw an item on CGSpace's OAI that is using dc.cplace.country and dc.rplace.region, which we stopped using in 2016 after the metadata migrations:
  • + +

    stale metadata in OAI

    + +
  • I don't see these fields anywhere in our source code or the database's metadata registry, so maybe it's just a cache issue
  • I will have to check the OAI cron scripts on DSpace Test, and then run them on CGSpace
  • - -
  • Running dspace oai import and dspace oai clean-cache have zero effect, but this seems to rebuild the cache from scratch:

    - +
  • Running dspace oai import and dspace oai clean-cache have zero effect, but this seems to rebuild the cache from scratch:
  • +
    $ /home/dspacetest.cgiar.org/bin/dspace oai import -c
     ...
     63900 items imported so far...
    @@ -308,16 +259,12 @@ ILAC_Brief21_PMCA.pdf: 113462 bytes, checksum: 249fef468f401c066a119f5db687add0
     Total: 64056 items
     Purging cached OAI responses.
     OAI 2.0 manager action ended. It took 829 seconds.
    -
    - -
  • After reading some threads on the DSpace mailing list, I see that clean-cache is actually only for caching responses, ie to client requests in the OAI web application

  • - -
  • These are stored in [dspace]/var/oai/requests/

  • - -
  • The import command should theoretically catch situations like this where an item’s metadata was updated, but in this case we changed the metadata schema and it doesn’t seem to catch it (could be a bug!)

  • - -
  • Attempting a full rebuild of OAI on CGSpace:

    - +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace oai import -c
     ...
    @@ -329,225 +276,169 @@ OAI 2.0 manager action ended. It took 1032 seconds.
     real    17m20.156s
     user    4m35.293s
     sys     1m29.310s
    -
  • - -
  • Now the data for 105686 is correct in OAI: https://cgspace.cgiar.org/oai/request?verb=GetRecord&metadataPrefix=dim&identifier=oai:cgspace.cgiar.org:10568/6

  • - -
  • Perhaps I need to file a bug for this, or at least ask on the DSpace Test mailing list?

  • - -
  • I wonder if we could use a crosswalk to convert to a format that CG Core wants, like <date Type="Available">

  • + - -

    2017-04-13

    - +

    2017-04-13

    -

    Last Harvest Result: OAI server did not contain any updates on 2017-04-13 02:19:47.964

    - - -

    2017-04-14

    - +

    2017-04-14

    +
  • Reboot DSpace Test server to get new Linode kernel
  • - -

    2017-04-17

    - +

    2017-04-17

    - -

    2017-04-18

    - +
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    +  Detail: Key (bitstream_id)=(435) is still referenced from table "bundle".
    +

    2017-04-18

    $ git clone https://github.com/ilri/ckm-cgspace-rest-api.git
     $ cd ckm-cgspace-rest-api/app
     $ gem install bundler
     $ bundle
     $ cd ..
     $ rails -s
    -
    - -
  • I used Ansible to create a PostgreSQL user that only has SELECT privileges on the tables it needs:

    - -
    $ ansible linode02 -u aorth -b --become-user=postgres -K -m postgresql_user -a 'db=database name=username password=password priv=CONNECT/item:SELECT/metadatavalue:SELECT/metadatafieldregistry:SELECT/metadataschemaregistry:SELECT/collection:SELECT/handle:SELECT/bundle2bitstream:SELECT/bitstream:SELECT/bundle:SELECT/item2bundle:SELECT state=present
    -
  • - -
  • Need to look into running this via systemd

  • - -
  • This is interesting for creating runnable commands from bundle:

    - -
    $ bundle binstubs puma --path ./sbin
    -
  • + - -

    2017-04-19

    - +
    $ ansible linode02 -u aorth -b --become-user=postgres -K -m postgresql_user -a 'db=database name=username password=password priv=CONNECT/item:SELECT/metadatavalue:SELECT/metadatafieldregistry:SELECT/metadataschemaregistry:SELECT/collection:SELECT/handle:SELECT/bundle2bitstream:SELECT/bitstream:SELECT/bundle:SELECT/item2bundle:SELECT state=present
    +
    +
    $ bundle binstubs puma --path ./sbin
    +

    2017-04-19

    - -

    2017-04-20

    - +
    value.replace(" ||","||").replace("|| ","||").replace(" || ","||")
    +
    +
    unescape(value,"url")
    +
    +
    value.split('/')[-1].replace(/#.*$/,"")
    +
    +

    2017-04-20

    - -

    Flagging and filtering duplicates in OpenRefine

    - +
    value.replace(/\|\|$/,"")
    +
    +

    Flagging and filtering duplicates in OpenRefine

    COLLETOTRICHUM LINDEMUTHIANUM||                  FUSARIUM||GERMPLASM
    -
    - -
  • Add a description to the file names using:

    - +
    value + "__description:" + cells["dc.type"].value
    -
  • - -
  • Test import of 933 records:

    - +
    $ [dspace]/bin/dspace import -a -e aorth@mjanja.ch -c 10568/87193 -s /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ -m /tmp/ciat
     $ wc -l /tmp/ciat
     933 /tmp/ciat
    -
  • - -
  • Run system updates on CGSpace and reboot server

  • - -
  • This includes switching nginx to using upstream with keepalive instead of direct proxy_pass

  • - -
  • Re-deploy CGSpace to latest 5_x-prod, including the PABRA and RTB XMLUI themes, as well as the PDF processing and CMYK changes

  • - -
  • More work on Ansible infrastructure stuff for Tsega’s CKM DSpace REST API

  • - -
  • I’m going to start re-processing all the PDF thumbnails on CGSpace, one community at a time:

    - +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace filter-media -f -v -i 10568/71249 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
    -
  • - - -

    2017-04-22

    - +

    2017-04-22

    - -

    2017-04-24

    - +
    dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1136, 1132, 1220, 1236, 3002, 3255, 5322);
    +

    2017-04-24

    2017-04-24 00:00:15,578 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Processing (55 of 58853): 70590
     2017-04-24 00:00:15,586 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Processing (56 of 58853): 74507
     2017-04-24 00:00:15,614 ERROR com.atmire.dspace.discovery.AtmireSolrService @ this IndexWriter is closed
     org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this IndexWriter is closed
    -    at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
    -    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
    -    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
    -    at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
    -    at org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:285)
    -    at org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:271)
    -    at org.dspace.discovery.SolrServiceImpl.unIndexContent(SolrServiceImpl.java:331)
    -    at org.dspace.discovery.SolrServiceImpl.unIndexContent(SolrServiceImpl.java:315)
    -    at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:803)
    -    at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:876)
    -    at org.dspace.discovery.IndexClient.main(IndexClient.java:127)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -    at java.lang.reflect.Method.invoke(Method.java:498)
    -    at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    -    at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    -
    - -
  • Looking at the past few days of logs, it looks like the indexing process started crashing on 2017-04-20:

    - + at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552) + at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) + at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) + at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124) + at org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:285) + at org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:271) + at org.dspace.discovery.SolrServiceImpl.unIndexContent(SolrServiceImpl.java:331) + at org.dspace.discovery.SolrServiceImpl.unIndexContent(SolrServiceImpl.java:315) + at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:803) + at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:876) + at org.dspace.discovery.IndexClient.main(IndexClient.java:127) + at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) + at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) + at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) + at java.lang.reflect.Method.invoke(Method.java:498) + at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226) + at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78) +
    # grep -c 'IndexWriter is closed' [dspace]/log/dspace.log.2017-04-*
     [dspace]/log/dspace.log.2017-04-01:0
     [dspace]/log/dspace.log.2017-04-02:0
    @@ -573,36 +464,28 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this Index
     [dspace]/log/dspace.log.2017-04-22:13278
     [dspace]/log/dspace.log.2017-04-23:22720
     [dspace]/log/dspace.log.2017-04-24:21422
    -
  • - -
  • I restarted Tomcat and re-ran the discovery process manually:

    - -
    [dspace]/bin/dspace index-discovery
    -
  • - -
  • Now everything is ok

  • - -
  • Finally finished manually running the cleanup task over and over and null’ing the conflicting IDs:

    - -
    dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1132, 1136, 1220, 1236, 3002, 3255, 5322, 5098, 5982, 5897, 6245, 6184, 4927, 6070, 4925, 6888, 7368, 7136, 7294, 7698, 7864, 10799, 10839, 11765, 13241, 13634, 13642, 14127, 14146, 15582, 16116, 16254, 17136, 17486, 17824, 18098, 22091, 22149, 22206, 22449, 22548, 22559, 22454, 22253, 22553, 22897, 22941, 30262, 33657, 39796, 46943, 56561, 58237, 58739, 58734, 62020, 62535, 64149, 64672, 66988, 66919, 76005, 79780, 78545, 81078, 83620, 84492, 92513, 93915);
    -
  • - -
  • Now running the cleanup script on DSpace Test and already seeing 11GB freed from the assetstore—it’s likely we haven’t had a cleanup task complete successfully in years…

  • + - -

    2017-04-25

    - +
    [dspace]/bin/dspace index-discovery
    +
    +
    dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1132, 1136, 1220, 1236, 3002, 3255, 5322, 5098, 5982, 5897, 6245, 6184, 4927, 6070, 4925, 6888, 7368, 7136, 7294, 7698, 7864, 10799, 10839, 11765, 13241, 13634, 13642, 14127, 14146, 15582, 16116, 16254, 17136, 17486, 17824, 18098, 22091, 22149, 22206, 22449, 22548, 22559, 22454, 22253, 22553, 22897, 22941, 30262, 33657, 39796, 46943, 56561, 58237, 58739, 58734, 62020, 62535, 64149, 64672, 66988, 66919, 76005, 79780, 78545, 81078, 83620, 84492, 92513, 93915);
    +
    +

    2017-04-25

    # find [dspace]/assetstore/ -type f | wc -l
     113104
    -
    - -
  • Troubleshooting the Atmire Solr update process that runs at 3:00 AM every morning, after finishing at 100% it has this error:

    - +
    [=================================================> ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
     [=================================================> ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
     [=================================================> ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
    @@ -653,36 +536,26 @@ Caused by: java.lang.ClassNotFoundException: org.dspace.statistics.content.DSpac
     	at java.lang.Class.forName(Class.java:264)
     	at com.atmire.statistics.statlet.XmlParser.parsedatasetGenerator(SourceFile:299)
     	at com.atmire.statistics.display.StatisticsGraph.parseDatasetGenerators(SourceFile:250)
    -
  • - -
  • Run system updates on DSpace Test and reboot the server (new Java 8 131)

  • - -
  • Run the SQL cleanups on the bundle table on CGSpace and run the [dspace]/bin/dspace cleanup task

  • - -
  • I will be interested to see the file count in the assetstore as well as the database size after the next backup (last backup size is 111M)

  • - -
  • Final file count after the cleanup task finished: 77843

  • - -
  • So that is 30,000 files, and about 7GB

  • - -
  • Add logging to the cleanup cron task

  • + - -

    2017-04-26

    - +

    2017-04-26

    $ gpg --keyserver hkp://keys.gnupg.net --recv-keys 409B6B1796C275462A1703113804BB82D39DC0E3
     $ \curl -sSL https://raw.githubusercontent.com/wayneeseguin/rvm/master/binscripts/rvm-installer | bash -s stable --ruby
     ... reload shell to get new Ruby
     $ gem install sass -v 3.3.14
     $ gem install compass -v 1.0.3
    -
    - -
  • Help Tsega re-deploy the ckm-cgspace-rest-api on DSpace Test

  • + diff --git a/docs/2017-05/index.html b/docs/2017-05/index.html index af4e169bd..ead480916 100644 --- a/docs/2017-05/index.html +++ b/docs/2017-05/index.html @@ -6,7 +6,7 @@ - + @@ -14,8 +14,8 @@ - - + + @@ -96,141 +96,104 @@

    - - -

    2017-05-01

    - +

    2017-05-01

    - -

    2017-05-02

    - + + +

    2017-05-02

    - -

    2017-05-04

    - +

    2017-05-04

    - -

    2017-05-05

    - +

    2017-05-05

    - -

    2017-05-06

    - +
    $ [dspace]/bin/dspace curate -t requiredmetadata -i 10568/1 -r - > /tmp/curation.out
    +
    +

    2017-05-06

    - -

    2017-05-07

    - +

    2017-05-07

    - -

    2017-05-08

    - +
    $ ./fix-metadata-values.py -i ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
    +
    +

    2017-05-08

    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m -XX:-UseGCOverheadLimit"
     $ [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10568/87775 /home/aorth/10947-1/10947-1.zip
     $ for collection in /home/aorth/10947-1/COLLECTION@10947-*; do [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10947/1 $collection; done
     $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager -r -f -u -t AIP -e some@user.com $item; done
    -
    - -
  • Note that in submission mode DSpace ignores the handle specified in mets.xml in the zip file, so you need to turn that off with -o ignoreHandle=false

  • - -
  • The -u option supresses prompts, to allow the process to run without user input

  • - -
  • Give feedback to CIFOR about their data quality:

    - + - -

    2017-05-09

    - +
  • +
  • Help Marianne from WLE with an Open Search query to show the latest WLE CRP outputs: https://cgspace.cgiar.org/open-search/discover?query=crpsubject:WATER%2C+LAND+AND+ECOSYSTEMS&sort_by=2&order=DESC
  • +
  • This uses the webui's item list sort options, see webui.itemlist.sort-option in dspace.cfg
  • +
  • The equivalent Discovery search would be: https://cgspace.cgiar.org/discover?filtertype_1=crpsubject&filter_relational_operator_1=equals&filter_1=WATER%2C+LAND+AND+ECOSYSTEMS&submit_apply_filter=&query=&rpp=10&sort_by=dc.date.issued_dt&order=desc
  • + +

    2017-05-09

    - -

    2017-05-10

    - +
    dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
    +
    +
    Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "handle_pkey"
    +  Detail: Key (handle_id)=(80928) already exists.
    +
    +

    2017-05-10

    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit"
     $ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2517/10947-2517.zip
     $ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2515/10947-2515.zip
    @@ -238,119 +201,95 @@ $ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@
     $ [dspace]/bin/dspace packager -s -t AIP -o ignoreHandle=false -e some@user.com -p 10568/80923 /home/aorth/10947-1/10947-1.zip
     $ for collection in /home/aorth/10947-1/COLLECTION@10947-*; do [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10947/1 $collection; done
     $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager -r -f -u -t AIP -e some@user.com $item; done
    -
    - -
  • Basically, import the smaller communities using recursive AIP import (with skipIfParentMissing)

  • - -
  • Then, for the larger collection, create the community, collections, and items separately, ingesting the items one by one

  • - -
  • The -XX:-UseGCOverheadLimit JVM option helps with some issues in large imports

  • - -
  • After this I ran the update-sequences.sql script (with Tomcat shut down), and cleaned up the 200+ blank metadata records:

    - -
    dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
    -
  • + - -

    2017-05-13

    - +
    dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
    +

    2017-05-13

    - -

    2017-05-15

    - +

    2017-05-15

    $ ./fix-metadata-values.py -i /tmp/ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p 'fuuu'
    -
    - -
  • These include:

    - + - -

    2017-05-16

    - +
  • +
  • +

    Re-deploy CGSpace and DSpace Test and run system updates

    +
  • +
  • +

    Reboot DSpace Test

    +
  • +
  • +

    Fix cron jobs for log management on DSpace Test, as they weren't catching dspace.log.* files correctly and we had over six months of them and they were taking up many gigs of disk space

    +
  • + +

    2017-05-16

    - -

    2017-05-17

    - +

    2017-05-17

    - -

    2017-05-21

    - +
    ERROR: duplicate key value violates unique constraint "handle_pkey" Detail: Key (handle_id)=(84834) already exists.
    +
    +
    dspace=# select * from handle where handle_id=84834;
    + handle_id |   handle   | resource_type_id | resource_id
    +-----------+------------+------------------+-------------
    +     84834 | 10947/1332 |                2 |       87113
    +
    +
    dspace=# select * from handle where handle_id=(select max(handle_id) from handle);
    + handle_id |  handle  | resource_type_id | resource_id
    +-----------+----------+------------------+-------------
    +     86873 | 10947/99 |                2 |       89153
    +(1 row)
    +
    +
    dspace=# select setval('handle_seq',86873);
    +
    +

    2017-05-21

    - -

    2017-05-22

    - +

    2017-05-22

    $ grep 10947/ /tmp/collections | grep -v cocoon | awk -F/ '{print $3"/"$4}' | awk -F\" '{print $1}' | vim -
    -
    - -
  • Then I joined them together and ran this old SQL query from the dspace-tech mailing list which gives you authors for items in those collections:

    - +
    dspace=# select distinct text_value
     from metadatavalue
     where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author')
    @@ -364,82 +303,62 @@ AND resource_id IN (select item_id from collection2item where collection_id IN (
     47/2519', '10947/2708', '10947/2526', '10947/2871', '10947/2527', '10947/4467', '10947/3457', '10947/2528', '10947/2529', '10947/2533', '10947/2530', '10947/2
     531', '10947/2532', '10947/2538', '10947/2534', '10947/2540', '10947/2900', '10947/2539', '10947/2784', '10947/2536', '10947/2805', '10947/2541', '10947/2535'
     , '10947/2537', '10568/93761')));
    -
  • - -
  • To get a CSV (with counts) from that:

    - +
    dspace=# \copy (select distinct text_value, count(*)
     from metadatavalue
     where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author')
     AND resource_type_id = 2
     AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10947/2', '10947/3', '10947/10', '10947/4', '10947/5', '10947/6', '10947/7', '10947/8', '10947/9', '10947/11', '10947/25', '10947/12', '10947/26', '10947/27', '10947/28', '10947/29', '10947/30', '10947/13', '10947/14', '10947/15', '10947/16', '10947/31', '10947/32', '10947/33', '10947/34', '10947/35', '10947/36', '10947/37', '10947/17', '10947/18', '10947/38', '10947/19', '10947/39', '10947/40', '10947/41', '10947/42', '10947/43', '10947/2512', '10947/44', '10947/20', '10947/21', '10947/45', '10947/46', '10947/47', '10947/48', '10947/49', '10947/22', '10947/23', '10947/24', '10947/50', '10947/51', '10947/2518', '10947/2776', '10947/2790', '10947/2521', '10947/2522', '10947/2782', '10947/2525', '10947/2836', '10947/2524', '10947/2878', '10947/2520', '10947/2523', '10947/2786', '10947/2631', '10947/2589', '10947/2519', '10947/2708', '10947/2526', '10947/2871', '10947/2527', '10947/4467', '10947/3457', '10947/2528', '10947/2529', '10947/2533', '10947/2530', '10947/2531', '10947/2532', '10947/2538', '10947/2534', '10947/2540', '10947/2900', '10947/2539', '10947/2784', '10947/2536', '10947/2805', '10947/2541', '10947/2535', '10947/2537', '10568/93761'))) group by text_value order by count desc) to /tmp/cgiar-librar-authors.csv with csv;
    -
  • - - -

    2017-05-23

    - +

    2017-05-23

    dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id=119) to /tmp/wle.csv with csv;
     COPY 111
    -
    - -
  • Respond to Atmire message about ORCIDs, saying that right now we’d prefer to just have them available via REST API like any other metadata field, and that I’m available for a Skype

  • + - -

    2017-05-26

    - +

    2017-05-26

    - -

    2017-05-28

    - +

    2017-05-28

    dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Omore, A%';
    -
    - -
  • Set the authority for all variations to one containing an ORCID:

    - +
    dspace=# update metadatavalue set authority='4428ee88-90ef-4107-b837-3c0ec988520b', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Omore, A%';
     UPDATE 187
    -
  • - -
  • Next I need to do Edgar Twine:

    - -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Twine, E%';
    -
  • - -
  • But it doesn’t look like any of his existing entries are linked to an authority which has an ORCID, so I edited the metadata via “Edit this Item” and looked up his ORCID and linked it there

  • - -
  • Now I should be able to set his name variations to the new authority:

    - -
    dspace=# update metadatavalue set authority='f70d0a01-d562-45b8-bca3-9cf7f249bc8b', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Twine, E%';
    -
  • - -
  • Run the corrections on CGSpace and then update discovery / authority

  • - -
  • I notice that there are a handful of java.lang.OutOfMemoryError: Java heap space errors in the Catalina logs on CGSpace, I should go look into that…

  • + - -

    2017-05-29

    - +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Twine, E%';
    +
    +
    dspace=# update metadatavalue set authority='f70d0a01-d562-45b8-bca3-9cf7f249bc8b', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Twine, E%';
    +
    +

    2017-05-29

    diff --git a/docs/2017-06/index.html b/docs/2017-06/index.html index 043e70385..91ab69d36 100644 --- a/docs/2017-06/index.html +++ b/docs/2017-06/index.html @@ -6,7 +6,7 @@ - + @@ -14,8 +14,8 @@ - - + + @@ -96,83 +96,69 @@

    - - -

    2017-06-01

    - +

    2017-06-01

    - -

    2017-06-04

    - +

    2017-06-04

    - -

    2016-06-05

    - +

    2016-06-05

    + +
  • Finally, after some filtering to see which small outliers there were (based on dc.format.extent using “p. 1-14” vs “29 p."), create a new column with last page number:
  • + +
  • Then create a new, unique file name to be used in the output, based on a SHA1 of the dc.title and with a description: -
  • + +
  • Start processing 769 records after filtering the following (there are another 159 records that have some other format, or for example they have their own PDF which I will process later), using a modified generate-thumbnails.py script to read certain fields and then pass to GhostScript: -
  • -
  • 17 of the items have issues with incorrect page number ranges, and upon closer inspection they do not appear in the referenced PDF
  • - -
  • I’ve flagged them and proceeded without them (752 total) on DSpace Test:

    - -
    $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/93843 --source /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &> /tmp/ciat-books.log
    -
  • - -
  • I went and did some basic sanity checks on the remaining items in the CIAT Book Chapters and decided they are mostly fine (except one duplicate and the flagged ones), so I imported them to DSpace Test too (162 items)

  • - -
  • Total items in CIAT Book Chapters is 914, with the others being flagged for some reason, and we should send that back to CIAT

  • - -
  • Restart Tomcat on CGSpace so that the cg.identifier.wletheme field is available on REST API for Macaroni Bros

  • - -

    2017-06-07

    - + +
  • 17 of the items have issues with incorrect page number ranges, and upon closer inspection they do not appear in the referenced PDF
  • +
  • I've flagged them and proceeded without them (752 total) on DSpace Test:
  • + +
    $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/93843 --source /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &> /tmp/ciat-books.log
    +
    +

    2017-06-07

    +----------------+----------------------------+---------------------+---------+
     | Version        | Description                | Installed on        | State   |
     +----------------+----------------------------+---------------------+---------+
    @@ -197,85 +183,64 @@
     | 5.5.2015.12.03 | Atmire MQM migration       | 2016-11-27 06:39:06 | OutOrde |
     | 5.6.2016.08.08 | CUA emailreport migration  | 2017-01-29 11:18:56 | OutOrde |
     +----------------+----------------------------+---------------------+---------+
    -
    - -
  • Merge the pull request for WLE Phase II themes

  • + - -

    2017-06-18

    - +

    2017-06-18

    - -

    2017-06-20

    - +

    2017-06-20

    + +
  • Finally import 914 CIAT Book Chapters to CGSpace in two batches:
  • +
    $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &> /tmp/ciat-books.log
     $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books2.map &> /tmp/ciat-books2.log
    -
    - - -

    2017-06-25

    - +

    2017-06-25

    dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=237 and text_value like 'Regenerating Degraded Landscapes%';
    -text_value
    + text_value
     ------------
     (0 rows)
    -
    - -
  • Marianne from WLE asked if they can have both Phase I and II research themes together in the item submission form

  • - -
  • Perhaps we can add them together in the same question for cg.identifier.wletheme

  • + - -

    2017-06-30

    - +

    2017-06-30

    Java stacktrace: java.util.NoSuchElementException: Timeout waiting for idle object
    -
    - -
  • After looking at the Tomcat logs, Munin graphs, and PostgreSQL connection stats, it seems there is just a high load

  • - -
  • Might be a good time to adjust DSpace’s database connection settings, like I first mentioned in April, 2017 after reading the 2017-04 DCAT comments

  • - -
  • I’ve adjusted the following in CGSpace’s config:

    - + - -

    Test A for displaying the Phase I and II research themes -Test B for displaying the Phase I and II research themes

    +
  • +
  • We will need to adjust this again (as well as the pg_hba.conf settings) when we deploy tsega's REST API
  • +
  • Whip up a test for Marianne of WLE to be able to show both their Phase I and II research themes in the CGSpace item submission form:
  • + +

    Test A for displaying the Phase I and II research themes +Test B for displaying the Phase I and II research themes

    diff --git a/docs/2017-07/index.html b/docs/2017-07/index.html index 260ceb3d4..0a8f229c0 100644 --- a/docs/2017-07/index.html +++ b/docs/2017-07/index.html @@ -8,16 +8,13 @@ @@ -28,18 +25,15 @@ We can use PostgreSQL’s extended output format (-x) plus sed to format the - + @@ -120,182 +114,138 @@ We can use PostgreSQL’s extended output format (-x) plus sed to format the

    -

    2017-07-01

    - +

    2017-07-01

    - -

    2017-07-04

    - +

    2017-07-04

    -
    $ psql dspacenew -x -c 'select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=5 order by element, qualifier;' | sed -r 's:^-\[ RECORD (.*) \]-+$:</dc-type>\n<dc-type>\n<schema>cg</schema>:;s:([^ ]*) +\| (.*):  <\1>\2</\1>:;s:^$:</dc-type>:;1s:</dc-type>\n::'
    -
    - - - -

    2017-07-24

    - + + +

    2017-07-24

    $ for community in 10568/2347 10568/25209; do /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/27629 --child="$community"; done
    -
    - -
  • Discuss CGIAR Library data cleanup with Sisay and Abenet

  • + - -

    2017-07-27

    - +

    2017-07-27

    - -

    2017-07-28

    - +

    2017-07-28

    - -

    2017-07-29

    - +

    2017-07-29

    - -

    2017-07-30

    - +

    2017-07-30

    - -

    2017-07-31

    - +

    2017-07-31

    delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-FP4_CRMWestAfrica';
     update metadatavalue set text_value='FP3_VietnamLED' where resource_type_id=2 and metadata_field_id=134 and text_value='FP3_VeitnamLED';
     update metadatavalue set text_value='PII-FP1_PIRCCA' where resource_type_id=2 and metadata_field_id=235 and text_value='PII-SEA_PIRCCA';
     delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-WA_IntegratedInterventions';
    -
    - -
  • Now just waiting to run them on CGSpace, and then apply the modified input forms after Macaroni Bros give me an updated list

  • - -
  • Temporarily increase the nginx upload limit to 200MB for Sisay to upload the CIAT presentations

  • - -
  • Looking at CGSpace activity page, there are 52 Baidu bots concurrently crawling our website (I copied the activity page to a text file and grep it)!

    - +
    $ grep 180.76. /tmp/status | awk '{print $5}' | sort | uniq | wc -l
     52
    -
  • - -
  • From looking at the dspace.log I see they are all using the same session, which means our Crawler Session Manager Valve is working

  • + diff --git a/docs/2017-08/index.html b/docs/2017-08/index.html index a50b90da6..58efe028f 100644 --- a/docs/2017-08/index.html +++ b/docs/2017-08/index.html @@ -8,20 +8,19 @@ - + @@ -140,22 +138,21 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s

    -

    2017-08-01

    - +

    2017-08-01

    +
  • The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these!
  • Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
  • -
  • It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
  • +
  • It turns out that we're already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
  • Also, the bot has to successfully browse the page first so it can receive the HTTP header…
  • We might actually have to block these requests with HTTP 403 depending on the user agent
  • Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
  • @@ -163,87 +160,67 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
  • I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d
  • Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet
  • - -

    2017-08-02

    - +

    2017-08-02

    - -

    2017-08-05

    - +

    2017-08-05

    - -

    CIFOR OAI harvesting

    - +

    CIFOR OAI harvesting

    - -

    2017-08-07

    - +

    2017-08-07

    - -

    2017-08-08

    - +

    2017-08-08

    - -

    2017-08-09

    - +

    2017-08-09

    - -

    2017-08-10

    - + + +

    2017-08-10

    dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9'))) group by text_value order by count desc) to /tmp/cgiar-library-authors.csv with csv;
    -
    - -
  • Meeting with Peter and CGSpace team

    - + - -

    2017-08-11

    - +
  • +
  • Follow up with Atmire on the ticket about ORCID metadata in DSpace
  • +
  • Follow up with Lili and Andrea about the pending CCAFS metadata and flagship updates
  • + +

    2017-08-11

    - -

    2017-08-12

    - +

    2017-08-12

    - -

    2017-08-13

    - +
    # awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 5
    +    140 66.249.66.91
    +    404 66.249.66.90
    +   1479 50.116.102.77
    +   9794 45.5.184.196
    +  85736 70.32.83.92
    +
    +
        # log oai requests
    +    location /oai {
    +        access_log /var/log/nginx/oai.log;
    +        proxy_pass http://tomcat_http;
    +    }
    +

    2017-08-13

    - -

    2017-08-14

    - +

    2017-08-14

    $ ./fix-metadata-values.py -i /tmp/authors-fix-523.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p fuuuu
    -
    - -
  • There were only three deletions so I just did them manually:

    - +
    dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='C';
     DELETE 1
     dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='WSSD';
    -
  • - -
  • Generate a new list of authors from the CGIAR Library community for Peter to look through now that the initial corrections have been done

  • - -
  • Thinking about resource limits for PostgreSQL again after last week’s CGSpace crash and related to a recently discussion I had in the comments of the April, 2017 DCAT meeting notes

  • - -
  • In that thread Chris Wilper suggests a new default of 35 max connections for db.maxconnections (from the current default of 30), knowing that each DSpace web application gets to use up to this many on its own

  • - -
  • It would be good to approximate what the theoretical maximum number of connections on a busy server would be, perhaps by looking to see which apps use SQL:

    - +
    $ grep -rsI SQLException dspace-jspui | wc -l          
     473
     $ grep -rsI SQLException dspace-oai | wc -l  
    @@ -320,39 +281,26 @@ $ grep -rsI SQLException dspace-solr | wc -l
     0
     $ grep -rsI SQLException dspace-xmlui | wc -l
     866
    -
  • - -
  • Of those five applications we’re running, only solr appears not to use the database directly

  • - -
  • And JSPUI is only used internally (so it doesn’t really count), leaving us with OAI, REST, and XMLUI

  • - -
  • Assuming each takes a theoretical maximum of 35 connections during a heavy load (35 * 3 = 105), that would put the connections well above PostgreSQL’s default max of 100 connections (remember a handful of connections are reserved for the PostgreSQL super user, see superuser_reserved_connections)

  • - -
  • So we should adjust PostgreSQL’s max connections to be DSpace’s db.maxconnections * 3 + 3

  • - -
  • This would allow each application to use up to db.maxconnections and not to go over the system’s PostgreSQL limit

  • - -
  • Perhaps since CGSpace is a busy site with lots of resources we could actually use something like 40 for db.maxconnections

  • - -
  • Also worth looking into is to set up a database pool using JNDI, as apparently DSpace’s db.poolname hasn’t been used since around DSpace 1.7 (according to Chris Wilper’s comments in the thread)

  • - -
  • Need to go check the PostgreSQL connection stats in Munin on CGSpace from the past week to get an idea if 40 is appropriate

  • - -
  • Looks like connections hover around 50:

  • + - -

    PostgreSQL connections 2017-08

    - +

    PostgreSQL connections 2017-08

    - -

    2017-08-15

    - +

    2017-08-15

    - -

    2017-08-16

    - +

    2017-08-16

    dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=254;
    -
    - -
  • And actually, we can do it for other generic fields for items in those collections, for example dc.description.abstract:

    - +
    dspace=# update metadatavalue set text_lang='en_US' where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'abstract') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9')))
    -
  • - -
  • And on others like dc.language.iso, dc.relation.ispartofseries, dc.type, dc.title, etc…

  • - -
  • Also, to move fields from dc.identifier.url to cg.identifier.url[en_US] (because we don’t use the Dublin Core one for some reason):

    - +
    dspace=# update metadatavalue set metadata_field_id = 219, text_lang = 'en_US' where resource_type_id = 2 AND metadata_field_id = 237;
     UPDATE 15
    -
  • - -
  • Set the text_lang of all dc.identifier.uri (Handle) fields to be NULL, just like default DSpace does:

    - +
    dspace=# update metadatavalue set text_lang=NULL where resource_type_id = 2 and metadata_field_id = 25 and text_value like 'http://hdl.handle.net/10947/%';
     UPDATE 4248
    -
  • - -
  • Also update the text_lang of dc.contributor.author fields for metadata in these collections:

    - +
    dspace=# update metadatavalue set text_lang=NULL where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9')));
     UPDATE 4899
    -
  • - -
  • Wow, I just wrote this baller regex facet to find duplicate authors:

    - -
    isNotNull(value.match(/(CGIAR .+?)\|\|\1/))
    -
  • - -
  • This would be true if the authors were like CGIAR System Management Office||CGIAR System Management Office, which some of the CGIAR Library’s were

  • - -
  • Unfortunately when you fix these in OpenRefine and then submit the metadata to DSpace it doesn’t detect any changes, so you have to edit them all manually via DSpace’s “Edit Item”

  • - -
  • Ooh! And an even more interesting regex would match any duplicated author:

    - -
    isNotNull(value.match(/(.+?)\|\|\1/))
    -
  • - -
  • Which means it can also be used to find items with duplicate dc.subject fields…

  • - -
  • Finally sent Peter the final dump of the CGIAR System Organization community so he can have a last look at it

  • - -
  • Post a message to the dspace-tech mailing list to ask about querying the AGROVOC API from the submission form

  • - -
  • Abenet was asking if there was some way to hide certain internal items from the “ILRI Research Outputs” RSS feed (which is the top-level ILRI community feed), because Shirley was complaining

  • - -
  • I think we could use harvest.includerestricted.rss = false but the items might need to be 100% restricted, not just the metadata

  • - -
  • Adjust Ansible postgres role to use max_connections from a template variable and deploy a new limit of 123 on CGSpace

  • + - -

    2017-08-17

    - +
    isNotNull(value.match(/(CGIAR .+?)\|\|\1/))
    +
    +
    isNotNull(value.match(/(.+?)\|\|\1/))
    +
    +

    2017-08-17

    2017-08-17 07:55:31,396 ERROR net.sf.ehcache.store.DiskStore @ cocoon-ehcacheCache: Could not read disk store element for key PK_G-aspect-cocoon://DRI/12/handle/10568/65885?pipelinehash=823411183535858997_T-Navigation-3368194896954203241. Error was invalid stream header: 00000000
     java.io.StreamCorruptedException: invalid stream header: 00000000
    -
    - -
  • Weird that these errors seem to have started on August 11th, the same day we had capacity issues with PostgreSQL:

    - +
    # grep -c "ERROR net.sf.ehcache.store.DiskStore" dspace.log.2017-08-*
     dspace.log.2017-08-01:0
     dspace.log.2017-08-02:0
    @@ -453,46 +380,37 @@ dspace.log.2017-08-14:2135
     dspace.log.2017-08-15:1506
     dspace.log.2017-08-16:1935
     dspace.log.2017-08-17:584
    -
  • - -
  • There are none in 2017-07 either…

  • - -
  • A few posts on the dspace-tech mailing list say this is related to the Cocoon cache somehow

  • - -
  • I will clear the XMLUI cache for now and see if the errors continue (though perpaps shutting down Tomcat and removing the cache is more effective somehow?)

  • - -
  • We tested the option for limiting restricted items from the RSS feeds on DSpace Test

  • - -
  • I created four items, and only the two with public metadata showed up in the community’s RSS feed:

    - + - -

    2017-08-18

    - +
  • +
  • Peter responded and said that he doesn't want to limit items to be restricted just so we can change the RSS feeds
  • + +

    2017-08-18

    $ ./sparql-query http://202.45.139.84:10035/catalogs/fao/repositories/agrovoc
     sparql$ PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
     SELECT 
    -?label 
    +    ?label 
     WHERE {  
    -{  ?concept  skos:altLabel ?label . } UNION {  ?concept  skos:prefLabel ?label . }
    -FILTER regex(str(?label), "^fish", "i") .
    +   {  ?concept  skos:altLabel ?label . } UNION {  ?concept  skos:prefLabel ?label . }
    +   FILTER regex(str(?label), "^fish", "i") .
     } LIMIT 10;
     
     ┌───────────────────────┐                                                      
    @@ -509,89 +427,67 @@ FILTER regex(str(?label), "^fish", "i") .
     │ fishing times         │                                                      
     │ fish passes           │                                                      
     └───────────────────────┘
    -
    - -
  • More examples about SPARQL syntax: https://github.com/rsinger/openlcsh/wiki/Sparql-Examples

  • - -
  • I found this blog post about speeding up the Tomcat startup time: http://skybert.net/java/improve-tomcat-startup-time/

  • - -
  • The startup time went from ~80s to 40s!

  • + - -

    2017-08-19

    - +

    2017-08-19

    - -

    2017-08-20

    - +

    2017-08-20

    - -

    2017-08-23

    - +
    dspace=# select * from metadatavalue where metadata_field_id=11 and date(text_value) > '2017-05-01T00:00:00Z';
    + metadata_value_id | item_id | metadata_field_id |      text_value      | text_lang | place | authority | confidence 
    +-------------------+---------+-------------------+----------------------+-----------+-------+-----------+------------
    +            123117 |    5872 |                11 | 2017-06-28T13:05:18Z |           |     1 |           |         -1
    +            123042 |    5869 |                11 | 2017-05-15T03:29:23Z |           |     1 |           |         -1
    +            123056 |    5870 |                11 | 2017-05-22T11:27:15Z |           |     1 |           |         -1
    +            123072 |    5871 |                11 | 2017-06-06T07:46:01Z |           |     1 |           |         -1
    +            123171 |    5874 |                11 | 2017-08-04T07:51:20Z |           |     1 |           |         -1
    +(5 rows)
    +
    +
    dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select item_id from metadatavalue where metadata_field_id=11 and date(text_value) > '2017-05-01T00:00:00Z');
    +   handle   
    +------------
    + 10947/4658
    + 10947/4659
    + 10947/4660
    + 10947/4661
    + 10947/4664
    +(5 rows)
    +

    2017-08-23

    - -

    2017-08-28

    - +

    2017-08-28

    - -

    2017-08-31

    - +

    2017-08-31

    ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error
     org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
    -
    - -
  • Looking at the logs I see we have been having hundreds or thousands of these errors a few times per week in 2017-07 and almost every day in 2017-08

  • - -
  • It seems that I changed the db.maxconnections setting from 70 to 40 around 2017-08-14, but Macaroni Bros also reduced their hourly hammering of the REST API then

  • - -
  • Nevertheless, it seems like a connection limit is not enough and that I should increase it (as well as the system’s PostgreSQL max_connections)

  • + diff --git a/docs/2017-09/index.html b/docs/2017-09/index.html index 709541494..2a8e9d259 100644 --- a/docs/2017-09/index.html +++ b/docs/2017-09/index.html @@ -8,14 +8,11 @@ @@ -26,16 +23,13 @@ Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account - + @@ -116,49 +110,33 @@ Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account

    -

    2017-09-06

    - +

    2017-09-06

    - -

    2017-09-07

    - +

    2017-09-07

    - -

    2017-09-10

    - +

    2017-09-10

    dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
     DELETE 58
    -
    - -
  • I also ran it on DSpace Test because we’ll be migrating the CGIAR Library soon and it would be good to catch these before we migrate

  • - -
  • Run system updates and restart DSpace Test

  • - -
  • We only have 7.7GB of free space on DSpace Test so I need to copy some data off of it before doing the CGIAR Library migration (requires lots of exporting and creating temp files)

  • - -
  • I still have the original data from the CGIAR Library so I’ve zipped it up and sent it off to linode18 for now

  • - -
  • sha256sum of original-cgiar-library-6.6GB.tar.gz is: bcfabb52f51cbdf164b61b7e9b3a0e498479e4c1ed1d547d32d11f44c0d5eb8a

  • - -
  • Start doing a test run of the CGIAR Library migration locally

  • - -
  • Notes and todo checklist here for now: https://gist.github.com/alanorth/3579b74e116ab13418d187ed379abd9c

  • - -
  • Create pull request for Phase I and II changes to CCAFS Project Tags: #336

  • - -
  • We’ve been discussing with Macaroni Bros and CCAFS for the past month or so and the list of tags was recently finalized

  • - -
  • There will need to be some metadata updates — though if I recall correctly it is only about seven records — for that as well, I had made some notes about it in 2017-07, but I’ve asked for more clarification from Lili just in case

  • - -
  • Looking at the DSpace logs to see if we’ve had a change in the “Cannot get a connection” errors since last month when we adjusted the db.maxconnections parameter on CGSpace:

    - +
    # grep -c "Cannot get a connection, pool error Timeout waiting for idle object" dspace.log.2017-09-*
     dspace.log.2017-09-01:0
     dspace.log.2017-09-02:0
    @@ -170,108 +148,84 @@ dspace.log.2017-09-07:0
     dspace.log.2017-09-08:10
     dspace.log.2017-09-09:0
     dspace.log.2017-09-10:0
    -
  • - -
  • Also, since last month (2017-08) Macaroni Bros no longer runs their REST API scraper every hour, so I’m sure that helped

  • - -
  • There are still some errors, though, so maybe I should bump the connection limit up a bit

  • - -
  • I remember seeing that Munin shows that the average number of connections is 50 (which is probably mostly from the XMLUI) and we’re currently allowing 40 connections per app, so maybe it would be good to bump that value up to 50 or 60 along with the system’s PostgreSQL max_connections (formula should be: webapps * 60 + 3, or 3 * 60 + 3 = 183 in our case)

  • - -
  • I updated both CGSpace and DSpace Test to use these new settings (60 connections per web app and 183 for system PostgreSQL limit)

  • - -
  • I’m expecting to see 0 connection errors for the next few months

  • + - -

    2017-09-11

    - +

    2017-09-11

    - -

    2017-09-12

    - +

    2017-09-12

    $ sudo tcpdump -i en0 -w without-cached-xsd.dump dst port 80 and 'tcp[32:4] = 0x47455420'
    -
    - -
  • Great TCP dump guide here: https://danielmiessler.com/study/tcpdump

  • - -
  • The last part of that command filters for HTTP GET requests, of which there should have been many to fetch all the XSD files for validation

  • - -
  • I sent a message to the mailing list to see if anyone knows more about this

  • - -
  • In looking at the tcpdump results I notice that there is an update check to the ehcache server on every iteration of the ingest loop, for example:

    - +
    09:39:36.008956 IP 192.168.8.124.50515 > 157.189.192.67.http: Flags [P.], seq 1736833672:1736834103, ack 147469926, win 4120, options [nop,nop,TS val 1175113331 ecr 550028064], length 431: HTTP: GET /kit/reflector?kitID=ehcache.default&pageID=update.properties&id=2130706433&os-name=Mac+OS+X&jvm-name=Java+HotSpot%28TM%29+64-Bit+Server+VM&jvm-version=1.8.0_144&platform=x86_64&tc-version=UNKNOWN&tc-product=Ehcache+Core+1.7.2&source=Ehcache+Core&uptime-secs=0&patch=UNKNOWN HTTP/1.1
    -
  • - -
  • Turns out this is a known issue and Ehcache has refused to make it opt-in: https://jira.terracotta.org/jira/browse/EHC-461

  • - -
  • But we can disable it by adding an updateCheck="false" attribute to the main <ehcache > tag in dspace-services/src/main/resources/caching/ehcache-config.xml

  • - -
  • After re-compiling and re-deploying DSpace I no longer see those update checks during item submission

  • - -
  • I had a Skype call with Bram Luyten from Atmire to discuss various issues related to ORCID in DSpace

    - + - -

    2017-09-13

    - +
  • + +

    2017-09-13

    # awk '{print $1}' /var/log/nginx/oai.log | sort -n | uniq -c | sort -h | tail -n 10
    -  1 213.136.89.78
    -  1 66.249.66.90
    -  1 66.249.66.92
    -  3 68.180.229.31
    -  4 35.187.22.255
    -13745 54.70.175.86
    -15814 34.211.17.113
    -15825 35.161.215.53
    -16704 54.70.51.7
    -
    - -
  • Compared to the previous day’s logs it looks VERY high:

    - + 1 213.136.89.78 + 1 66.249.66.90 + 1 66.249.66.92 + 3 68.180.229.31 + 4 35.187.22.255 + 13745 54.70.175.86 + 15814 34.211.17.113 + 15825 35.161.215.53 + 16704 54.70.51.7 +
    # awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
    -  1 207.46.13.39
    -  1 66.249.66.93
    -  2 66.249.66.91
    -  4 216.244.66.194
    - 14 66.249.66.90
    -
  • - -
  • The user agents for those top IPs are:

    - + 1 207.46.13.39 + 1 66.249.66.93 + 2 66.249.66.91 + 4 216.244.66.194 + 14 66.249.66.90 + +
  • +
  • And this user agent has never been seen before today (or at least recently!):
  • +
    # grep -c "API scraper" /var/log/nginx/oai.log
     62088
     # zgrep -c "API scraper" /var/log/nginx/oai.log.*.gz
    @@ -304,214 +258,179 @@ dspace.log.2017-09-10:0
     /var/log/nginx/oai.log.7.gz:0
     /var/log/nginx/oai.log.8.gz:0
     /var/log/nginx/oai.log.9.gz:0
    -
    - -
  • Some of these heavy users are also using XMLUI, and their user agent isn’t matched by the Tomcat Session Crawler valve, so each request uses a different session

  • - -
  • Yesterday alone the IP addresses using the API scraper user agent were responsible for 16,000 sessions in XMLUI:

    - +
    # grep -a -E "(54.70.51.7|35.161.215.53|34.211.17.113|54.70.175.86)" /home/cgspace.cgiar.org/log/dspace.log.2017-09-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     15924
    -
  • - -
  • If this continues I will definitely need to figure out who is responsible for this scraper and add their user agent to the session crawler valve regex

  • - -
  • A search for “API scraper” user agent on Google returns a robots.txt with a comment that this is the Yewno bot: http://www.escholarship.org/robots.txt

  • - -
  • Also, in looking at the DSpace logs I noticed a warning from OAI that I should look into:

    - +
    WARN  org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
    -
  • - -
  • Looking at the spreadsheet with deletions and corrections that CCAFS sent last week

  • - -
  • It appears they want to delete a lot of metadata, which I’m not sure they realize the implications of:

    - +
    dspace=# select text_value, count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange') group by text_value;                                                                                                                                                                                                                  
    -    text_value        | count                              
    +        text_value        | count                              
     --------------------------+-------                             
    -FP4_ClimateModels        |     6                              
    -FP1_CSAEvidence          |     7                              
    -SEA_UpscalingInnovation  |     7                              
    -FP4_Baseline             |    69                              
    -WA_Partnership           |     1                              
    -WA_SciencePolicyExchange |     6                              
    -SA_GHGMeasurement        |     2                              
    -SA_CSV                   |     7                              
    -EA_PAR                   |    18                              
    -FP4_Livestock            |     7                              
    -FP4_GenderPolicy         |     4                              
    -FP2_CRMWestAfrica        |    12                              
    -FP4_ClimateData          |    24                              
    -FP4_CCPAG                |     2                              
    -SEA_mitigationSAMPLES    |     2                              
    -SA_Biodiversity          |     1                              
    -FP4_PolicyEngagement     |    20                              
    -FP3_Gender               |     9                              
    -FP4_GenderToolbox        |     3                              
    + FP4_ClimateModels        |     6                              
    + FP1_CSAEvidence          |     7                              
    + SEA_UpscalingInnovation  |     7                              
    + FP4_Baseline             |    69                              
    + WA_Partnership           |     1                              
    + WA_SciencePolicyExchange |     6                              
    + SA_GHGMeasurement        |     2                              
    + SA_CSV                   |     7                              
    + EA_PAR                   |    18                              
    + FP4_Livestock            |     7                              
    + FP4_GenderPolicy         |     4                              
    + FP2_CRMWestAfrica        |    12                              
    + FP4_ClimateData          |    24                              
    + FP4_CCPAG                |     2                              
    + SEA_mitigationSAMPLES    |     2                              
    + SA_Biodiversity          |     1                              
    + FP4_PolicyEngagement     |    20                              
    + FP3_Gender               |     9                              
    + FP4_GenderToolbox        |     3                              
     (19 rows)
    -
  • - -
  • I sent CCAFS people an email to ask if they really want to remove these 200+ tags

  • - -
  • She responded yes, so I’ll at least need to do these deletes in PostgreSQL:

    - +
    dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange','FP_GII');
     DELETE 207
    -
  • - -
  • When we discussed this in late July there were some other renames they had requested, but I don’t see them in the current spreadsheet so I will have to follow that up

  • - -
  • I talked to Macaroni Bros and they said to just go ahead with the other corrections as well as their spreadsheet was evolved organically rather than systematically!

  • - -
  • The final list of corrections and deletes should therefore be:

    - +
    delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-FP4_CRMWestAfrica';
     update metadatavalue set text_value='FP3_VietnamLED' where resource_type_id=2 and metadata_field_id=134 and text_value='FP3_VeitnamLED';
     update metadatavalue set text_value='PII-FP1_PIRCCA' where resource_type_id=2 and metadata_field_id=235 and text_value='PII-SEA_PIRCCA';
     delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-WA_IntegratedInterventions';
     delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange','FP_GII');
    -
  • - -
  • Create and merge pull request to shut up the Ehcache update check (#337)

  • - -
  • Although it looks like there was a previous attempt to disable these update checks that was merged in DSpace 4.0 (although it only affects XMLUI): https://jira.duraspace.org/browse/DS-1492

  • - -
  • I commented there suggesting that we disable it globally

  • - -
  • I merged the changes to the CCAFS project tags (#336) but still need to finalize the metadata deletions/renames

  • - -
  • I merged the CGIAR Library theme changes (#338) to the 5_x-prod branch in preparation for next week’s migration

  • - -
  • I emailed the Handle administrators (hdladmin@cnri.reston.va.us) to ask them what the process for changing their prefix to be resolved by our resolver

  • - -
  • They responded and said that they need email confirmation from the contact of record of the other prefix, so I should have the CGIAR System Organization people email them before I send the new sitebndl.zip

  • - -
  • Testing to see how we end up with all these new authorities after we keep cleaning and merging them in the database

  • - -
  • Here are all my distinct authority combinations in the database before:

    - -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
    -text_value |              authority               | confidence 
    -------------+--------------------------------------+------------
    -Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |         -1
    -Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
    -Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |        600
    -Orth, A.   | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
    -Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |         -1
    -Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |          0
    -Orth, Alan | 0d575fa3-8ac4-4763-a90a-1248d4791793 |         -1
    -Orth, Alan | 67a9588f-d86a-4155-81a2-af457e9d13f9 |        600
    -(8 rows)
    -
  • - -
  • And then after adding a new item and selecting an existing “Orth, Alan” with an ORCID in the author lookup:

    - -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
    -text_value |              authority               | confidence 
    -------------+--------------------------------------+------------
    -Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |         -1
    -Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
    -Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |        600
    -Orth, A.   | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
    -Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |         -1
    -Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |          0
    -Orth, Alan | cb3aa5ae-906f-4902-97b1-2667cf148dde |        600
    -Orth, Alan | 0d575fa3-8ac4-4763-a90a-1248d4791793 |         -1
    -Orth, Alan | 67a9588f-d86a-4155-81a2-af457e9d13f9 |        600
    -(9 rows)
    -
  • - -
  • It created a new authority… let’s try to add another item and select the same existing author and see what happens in the database:

    - -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
    -text_value |              authority               | confidence 
    -------------+--------------------------------------+------------
    -Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |         -1
    -Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
    -Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |        600
    -Orth, A.   | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
    -Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |         -1
    -Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |          0
    -Orth, Alan | cb3aa5ae-906f-4902-97b1-2667cf148dde |        600
    -Orth, Alan | 0d575fa3-8ac4-4763-a90a-1248d4791793 |         -1
    -Orth, Alan | 67a9588f-d86a-4155-81a2-af457e9d13f9 |        600
    -(9 rows)
    -
  • - -
  • No new one… so now let me try to add another item and select the italicized result from the ORCID lookup and see what happens in the database:

    - -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
    -text_value |              authority               | confidence 
    -------------+--------------------------------------+------------
    -Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |         -1
    -Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
    -Orth, Alan | d85a8a5b-9b82-4aaf-8033-d7e0c7d9cb8f |        600
    -Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |        600
    -Orth, A.   | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
    -Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |         -1
    -Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |          0
    -Orth, Alan | cb3aa5ae-906f-4902-97b1-2667cf148dde |        600
    -Orth, Alan | 0d575fa3-8ac4-4763-a90a-1248d4791793 |         -1
    -Orth, Alan | 67a9588f-d86a-4155-81a2-af457e9d13f9 |        600
    -(10 rows)
    -
  • - -
  • Shit, it created another authority! Let’s try it again!

    - -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';                                                                                             
    -text_value |              authority               | confidence
    -------------+--------------------------------------+------------
    -Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |         -1
    -Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
    -Orth, Alan | d85a8a5b-9b82-4aaf-8033-d7e0c7d9cb8f |        600
    -Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |        600
    -Orth, Alan | 9aed566a-a248-4878-9577-0caedada43db |        600
    -Orth, A.   | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
    -Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |         -1
    -Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |          0
    -Orth, Alan | cb3aa5ae-906f-4902-97b1-2667cf148dde |        600
    -Orth, Alan | 0d575fa3-8ac4-4763-a90a-1248d4791793 |         -1
    -Orth, Alan | 67a9588f-d86a-4155-81a2-af457e9d13f9 |        600
    -(11 rows)
    -
  • - -
  • It added another authority… surely this is not the desired behavior, or maybe we are not using this as intented?

  • + - -

    2017-09-14

    - +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
    + text_value |              authority               | confidence 
    +------------+--------------------------------------+------------
    + Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |         -1
    + Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
    + Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |        600
    + Orth, A.   | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
    + Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |         -1
    + Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |          0
    + Orth, Alan | 0d575fa3-8ac4-4763-a90a-1248d4791793 |         -1
    + Orth, Alan | 67a9588f-d86a-4155-81a2-af457e9d13f9 |        600
    +(8 rows)
    +
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
    + text_value |              authority               | confidence 
    +------------+--------------------------------------+------------
    + Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |         -1
    + Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
    + Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |        600
    + Orth, A.   | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
    + Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |         -1
    + Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |          0
    + Orth, Alan | cb3aa5ae-906f-4902-97b1-2667cf148dde |        600
    + Orth, Alan | 0d575fa3-8ac4-4763-a90a-1248d4791793 |         -1
    + Orth, Alan | 67a9588f-d86a-4155-81a2-af457e9d13f9 |        600
    +(9 rows)
    +
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
    + text_value |              authority               | confidence 
    +------------+--------------------------------------+------------
    + Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |         -1
    + Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
    + Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |        600
    + Orth, A.   | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
    + Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |         -1
    + Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |          0
    + Orth, Alan | cb3aa5ae-906f-4902-97b1-2667cf148dde |        600
    + Orth, Alan | 0d575fa3-8ac4-4763-a90a-1248d4791793 |         -1
    + Orth, Alan | 67a9588f-d86a-4155-81a2-af457e9d13f9 |        600
    +(9 rows)
    +
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
    + text_value |              authority               | confidence 
    +------------+--------------------------------------+------------
    + Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |         -1
    + Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
    + Orth, Alan | d85a8a5b-9b82-4aaf-8033-d7e0c7d9cb8f |        600
    + Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |        600
    + Orth, A.   | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
    + Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |         -1
    + Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |          0
    + Orth, Alan | cb3aa5ae-906f-4902-97b1-2667cf148dde |        600
    + Orth, Alan | 0d575fa3-8ac4-4763-a90a-1248d4791793 |         -1
    + Orth, Alan | 67a9588f-d86a-4155-81a2-af457e9d13f9 |        600
    +(10 rows)
    +
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';                                                                                             
    + text_value |              authority               | confidence
    +------------+--------------------------------------+------------
    + Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |         -1
    + Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
    + Orth, Alan | d85a8a5b-9b82-4aaf-8033-d7e0c7d9cb8f |        600
    + Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |        600
    + Orth, Alan | 9aed566a-a248-4878-9577-0caedada43db |        600
    + Orth, A.   | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
    + Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |         -1
    + Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |          0
    + Orth, Alan | cb3aa5ae-906f-4902-97b1-2667cf148dde |        600
    + Orth, Alan | 0d575fa3-8ac4-4763-a90a-1248d4791793 |         -1
    + Orth, Alan | 67a9588f-d86a-4155-81a2-af457e9d13f9 |        600
    +(11 rows)
    +
    +

    2017-09-14

    - -

    2017-09-15

    - +

    2017-09-15

    dspace=# \i /tmp/ccafs-projects.sql 
     DELETE 5
     UPDATE 4
     UPDATE 1
     DELETE 1
     DELETE 207
    -
    - - -

    2017-09-17

    - +

    2017-09-17

    "server_admins" = (
     "300:0.NA/10568"
     "300:0.NA/10947"
    @@ -526,162 +445,121 @@ DELETE 207
     "300:0.NA/10568"
     "300:0.NA/10947"
     )
    -
    - -
  • More work on the CGIAR Library migration test run locally, as I was having problem with importing the last fourteen items from the CGIAR System Management Office community

  • - -
  • The problem was that we remapped the items to new collections after the initial import, so the items were using the 10947 prefix but the community and collection was using 10568

  • - -
  • I ended up having to read the AIP Backup and Restore closely a few times and then explicitly preserve handles and ignore parents:

    - -
    $ for item in 10568-93759/ITEM@10947-46*; do ~/dspace/bin/dspace packager -r -t AIP -o ignoreHandle=false -o ignoreParent=true -e aorth@mjanja.ch -p 10568/87738 $item; done
    -
  • - -
  • Also, this was in replace mode (-r) rather than submit mode (-s), because submit mode always generated a new handle even if I told it not to!

  • - -
  • I decided to start the import process in the evening rather than waiting for the morning, and right as the first community was finished importing I started seeing Timeout waiting for idle object errors

  • - -
  • I had to cancel the import, clean up a bunch of database entries, increase the PostgreSQL max_connections as a precaution, restart PostgreSQL and Tomcat, and then finally completed the import

  • + - -

    2017-09-18

    - +
    $ for item in 10568-93759/ITEM@10947-46*; do ~/dspace/bin/dspace packager -r -t AIP -o ignoreHandle=false -o ignoreParent=true -e aorth@mjanja.ch -p 10568/87738 $item; done
    +
    +

    2017-09-18

    - -

    With original DSpace 1.7 thumbnail

    - -

    After DSpace 5.5

    - +

    With original DSpace 1.7 thumbnail

    +

    After DSpace 5.5

    - -

    2017-09-19

    - +

    2017-09-19

    2017-09-19 00:00:14,953 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Processing (0 of 65808): 17607
     ...
     2017-09-19 00:04:18,017 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Processing (65807 of 65808): 83753
    -
    - -
  • Sisay asked if he could import 50 items for IITA that have already been checked by Bosede and Bizuwork

  • - -
  • I had a look at the collection and noticed a bunch of issues with item types and donors, so I asked him to fix those and import it to DSpace Test again first

  • - -
  • Abenet wants to be able to filter by ISI Journal in advanced search on queries like this: https://cgspace.cgiar.org/discover?filtertype_0=dateIssued&filtertype_1=dateIssued&filter_relational_operator_1=equals&filter_relational_operator_0=equals&filter_1=%5B2010+TO+2017%5D&filter_0=2017&filtertype=type&filter_relational_operator=equals&filter=Journal+Article

  • - -
  • I opened an issue to track this (#340) and will test it on DSpace Test soon

  • - -
  • Marianne Gadeberg from WLE asked if I would add an account for Adam Hunt on CGSpace and give him permissions to approve all WLE publications

  • - -
  • I told him to register first, as he’s a CGIAR user and needs an account to be created before I can add him to the groups

  • + - -

    2017-09-20

    - +

    2017-09-20

    - -

    2017-09-21

    - +
    $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -f -i 10947/1 -p "ImageMagick PDF Thumbnail"
    +
    +

    2017-09-21

    - -

    2017-09-22

    - +

    2017-09-22

    - -

    2017-09-24

    - +

    2017-09-24

    - -

    CGSpace memory week -DSpace Test memory week

    - +

    CGSpace memory week +DSpace Test memory week

    + +
  • Looking at Linode's instance pricing, for DSpace Test it seems we could use the same 8GB instance for $40/month, and then add block storage of ~300GB for $30 (block storage is currently in beta and priced at $0.10/GiB)
  • For CGSpace we could use the cheaper 12GB instance for $80 and then add block storage of 500GB for $50
  • -
  • I’ve sent Peter a message about moving DSpace Test to the New Jersey data center so we can test the block storage beta
  • +
  • I've sent Peter a message about moving DSpace Test to the New Jersey data center so we can test the block storage beta
  • Create pull request for adding ISI Journal to search filters (#341)
  • Peter asked if we could map all the items of type Journal Article in ILRI Archive to ILRI articles in journals and newsletters
  • It is easy to do via CSV using OpenRefine but I noticed that on CGSpace ~1,000 of the expected 2,500 are already mapped, while on DSpace Test they were not
  • -
  • I’ve asked Peter if he knows what’s going on (or who mapped them)
  • +
  • I've asked Peter if he knows what's going on (or who mapped them)
  • Turns out he had already mapped some, but requested that I finish the rest
  • - -
  • With this GREL in OpenRefine I can find items that are mapped, ie they have 10568/3|| or 10568/3$ in their collection field:

    - -
    isNotNull(value.match(/.+?10568\/3(\|\|.+|$)/))
    -
  • - -
  • Peter also made a lot of changes to the data in the Archives collections while I was attempting to import the changes, so we were essentially competing for PostgreSQL and Solr connections

  • - -
  • I ended up having to kill the import and wait until he was done

  • - -
  • I exported a clean CSV and applied the changes from that one, which was a hundred or two less than I thought there should be (at least compared to the current state of DSpace Test, which is a few months old)

  • +
  • With this GREL in OpenRefine I can find items that are mapped, ie they have 10568/3|| or 10568/3$ in their collection field:
  • - -

    2017-09-25

    - +
    isNotNull(value.match(/.+?10568\/3(\|\|.+|$)/))
    +
    +

    2017-09-25

    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';                                  
    -text_value  |              authority               | confidence              
    +  text_value  |              authority               | confidence              
     --------------+--------------------------------------+------------             
    -Grace, Delia |                                      |        600              
    -Grace, Delia | bfa61d7c-7583-4175-991c-2e7315000f0c |        600              
    -Grace, Delia | bfa61d7c-7583-4175-991c-2e7315000f0c |         -1              
    -Grace, D.    | 6a8ddca3-33c1-45f9-aa00-6fa9fc91e3fc |         -1
    -
    - -
  • Strangely, none of her authority entries have ORCIDs anymore…

  • - -
  • I’ll just fix the text values and forget about it for now:

    - + Grace, Delia | | 600 + Grace, Delia | bfa61d7c-7583-4175-991c-2e7315000f0c | 600 + Grace, Delia | bfa61d7c-7583-4175-991c-2e7315000f0c | -1 + Grace, D. | 6a8ddca3-33c1-45f9-aa00-6fa9fc91e3fc | -1 +
    dspace=# update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
     UPDATE 610
    -
  • - -
  • After this we have to reindex the Discovery and Authority cores (as tomcat7 user):

    - +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
     
    @@ -693,76 +571,66 @@ Retrieving all data
     Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer
     Exception: null
     java.lang.NullPointerException
    -    at org.dspace.authority.AuthorityValueGenerator.generateRaw(AuthorityValueGenerator.java:82)
    -    at org.dspace.authority.AuthorityValueGenerator.generate(AuthorityValueGenerator.java:39)
    -    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.prepareNextValue(DSpaceAuthorityIndexer.java:201)
    -    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:132)
    -    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    -    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    -    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:159)
    -    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    -    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    -    at org.dspace.authority.indexer.AuthorityIndexClient.main(AuthorityIndexClient.java:61)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -    at java.lang.reflect.Method.invoke(Method.java:498)
    -    at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    -    at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    +        at org.dspace.authority.AuthorityValueGenerator.generateRaw(AuthorityValueGenerator.java:82)
    +        at org.dspace.authority.AuthorityValueGenerator.generate(AuthorityValueGenerator.java:39)
    +        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.prepareNextValue(DSpaceAuthorityIndexer.java:201)
    +        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:132)
    +        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    +        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    +        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:159)
    +        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    +        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    +        at org.dspace.authority.indexer.AuthorityIndexClient.main(AuthorityIndexClient.java:61)
    +        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    +        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    +        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    +        at java.lang.reflect.Method.invoke(Method.java:498)
    +        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    +        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
     
     real    6m6.447s
     user    1m34.010s
     sys     0m12.113s
    -
  • - -
  • The index-authority script always seems to fail, I think it’s the same old bug

  • - -
  • Something interesting for my notes about JNDI database pool—since I couldn’t determine if it was working or not when I tried it locally the other day—is this error message that I just saw in the DSpace logs today:

    - +
    ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspaceLocal
     ...
     INFO  org.dspace.storage.rdbms.DatabaseManager @ Unable to locate JNDI dataSource: jdbc/dspaceLocal
     INFO  org.dspace.storage.rdbms.DatabaseManager @ Falling back to creating own Database pool
    -
  • - -
  • So it’s good to know that something gets printed when it fails because I didn’t see any mention of JNDI before when I was testing!

  • + - -

    2017-09-26

    - +

    2017-09-26

    - -

    2017-09-28

    - +

    2017-09-28

    # systemctl stop nginx
     # /opt/certbot-auto certonly --standalone --email aorth@mjanja.ch -d library.cgiar.org
     # systemctl start nginx
    -
    - -
  • I modified the nginx configuration of the ansible playbooks to use this new certificate and now the certificate is enabled and OCSP stapling is working:

    - +
    $ openssl s_client -connect cgspace.cgiar.org:443 -servername library.cgiar.org  -tls1_2 -tlsextdebug -status
     ...
     OCSP Response Data:
     ...
     Cert Status: good
    -
  • - + diff --git a/docs/2017-10/index.html b/docs/2017-10/index.html index 8b7a8962a..8f826a684 100644 --- a/docs/2017-10/index.html +++ b/docs/2017-10/index.html @@ -8,14 +8,11 @@ @@ -27,17 +24,14 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG - + @@ -118,416 +112,308 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG

    -

    2017-10-01

    - +

    2017-10-01

    - -

    2017-10-02

    - +
    http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
    +
    +

    2017-10-02

    2017-10-01 20:24:57,928 WARN  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:ldap_attribute_lookup:type=failed_search javax.naming.CommunicationException\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is java.net.ConnectException\colon; Connection timed out (Connection timed out)]
     2017-10-01 20:22:37,982 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:failed_login:no DN found for user pballantyne
    -
    - -
  • I thought maybe his account had expired (seeing as it’s was the first of the month) but he says he was finally able to log in today

  • - -
  • The logs for yesterday show fourteen errors related to LDAP auth failures:

    - +
    $ grep -c "ldap_authentication:type=failed_auth" dspace.log.2017-10-01
     14
    -
  • - -
  • For what it’s worth, there are no errors on any other recent days, so it must have been some network issue on Linode or CGNET’s LDAP server

  • - -
  • Linode emailed to say that linode578611 (DSpace Test) needs to migrate to a new host for a security update so I initiated the migration immediately rather than waiting for the scheduled time in two weeks

  • + - -

    2017-10-04

    - +

    2017-10-04

    http://library.cgiar.org/browse?value=Intellectual%20Assets%20Reports&type=subject → https://cgspace.cgiar.org/browse?value=Intellectual%20Assets%20Reports&type=subject
    -
    - -
  • We’ll need to check for browse links and handle them properly, including swapping the subject parameter for systemsubject (which doesn’t exist in Discovery yet, but we’ll need to add it) as we have moved their poorly curated subjects from dc.subject to cg.subject.system

  • - -
  • The second link was a direct link to a bitstream which has broken due to the sequence being updated, so I told him he should link to the handle of the item instead

  • - -
  • Help Sisay proof sixty-two IITA records on DSpace Test

  • - -
  • Lots of inconsistencies and errors in subjects, dc.format.extent, regions, countries

  • - -
  • Merge the Discovery search changes for ISI Journal (#341)

  • + - -

    2017-10-05

    - +

    2017-10-05

    - -

    2017-10-06

    - +
    # awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 10
    +    141 157.55.39.240
    +    145 40.77.167.85
    +    162 66.249.66.92
    +    181 66.249.66.95
    +    211 66.249.66.91
    +    312 66.249.66.94
    +    384 66.249.66.90
    +   1495 50.116.102.77
    +   3904 70.32.83.92
    +   9904 45.5.184.196
    +# awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
    +      5 66.249.66.71
    +      6 66.249.66.67
    +      6 68.180.229.31
    +      8 41.84.227.85
    +      8 66.249.66.92
    +     17 66.249.66.65
    +     24 66.249.66.91
    +     38 66.249.66.95
    +     69 66.249.66.90
    +    148 66.249.66.94
    +
    +

    2017-10-06

    - -

    Original flat thumbnails -Tweaked with border and box shadow

    - +

    Original flat thumbnails +Tweaked with border and box shadow

    - -

    2017-10-10

    - +

    2017-10-10

    - -

    Google Search Console -Google Search Console 2 -Google Search results

    - +

    Google Search Console +Google Search Console 2 +Google Search results

    10568/1637 10568/174 10568/27629
     10568/1642 10568/174 10568/27629
     10568/1614 10568/174 10568/27629
     10568/75561 10568/150 10568/27629
     10568/183 10568/230 10568/27629
    -
    - -
  • Delete community 10568174 (Sustainable livestock futures)

  • - -
  • Delete collections in 1056827629 that have zero items (33 of them!)

  • + - -

    2017-10-11

    - +

    2017-10-11

    - -

    Change of Address error

    - +

    Change of Address error

    - -

    2017-10-12

    - +

    2017-10-12

    - -

    2017-10-14

    - +

    2017-10-14

    - -

    2017-10-22

    - +

    2017-10-22

    - -

    2017-10-26

    - +

    2017-10-26

    # grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-25 | sort -n | uniq | wc -l
     18022
    -
    - -
  • Compared to other days there were two or three times the number of requests yesterday!

    - +
    # grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-23 | sort -n | uniq | wc -l
     3141
     # grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-26 | sort -n | uniq | wc -l
     7851
    -
  • - -
  • I still have no idea what was causing the load to go up today

  • - -
  • I finally investigated Magdalena’s issue with the item download stats and now I can’t reproduce it: I get the same number of downloads reported in the stats widget on the item page, the “Most Popular Items” page, and in Usage Stats

  • - -
  • I think it might have been an issue with the statistics not being fresh

  • - -
  • I added the admin group for the systems organization to the admin role of the top-level community of CGSpace because I guess Sisay had forgotten

  • - -
  • Magdalena asked if there was a way to reuse data in item submissions where items have a lot of similar data

  • - -
  • I told her about the possibility to use per-collection item templates, and asked if her items in question were all from a single collection

  • - -
  • We’ve never used it but it could be worth looking at

  • + - -

    2017-10-27

    - +

    2017-10-27

    - -

    2017-10-28

    - +

    2017-10-28

    - -

    2017-10-29

    - +

    2017-10-29

    # grep '2017-10-29 02:' dspace.log.2017-10-29 | grep -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     2049
    -
    - -
  • So there were 2049 unique sessions during the hour of 2AM

  • - -
  • Looking at my notes, the number of unique sessions was about the same during the same hour on other days when there were no alerts

  • - -
  • I think I’ll need to enable access logging in nginx to figure out what’s going on

  • - -
  • After enabling logging on requests to XMLUI on / I see some new bot I’ve never seen before:

    - -
    137.108.70.6 - - [29/Oct/2017:07:39:49 +0000] "GET /discover?filtertype_0=type&filter_relational_operator_0=equals&filter_0=Internal+Document&filtertype=author&filter_relational_operator=equals&filter=CGIAR+Secretariat HTTP/1.1" 200 7776 "-" "Mozilla/5.0 (compatible; CORE/0.6; +http://core.ac.uk; http://core.ac.uk/intro/contact)"
    -
  • - -
  • CORE seems to be some bot that is “Aggregating the world’s open access research papers”

  • - -
  • The contact address listed in their bot’s user agent is incorrect, correct page is simply: https://core.ac.uk/contact

  • - -
  • I will check the logs in a few days to see if they are harvesting us regularly, then add their bot’s user agent to the Tomcat Crawler Session Valve

  • - -
  • After browsing the CORE site it seems that the CGIAR Library is somehow a member of CORE, so they have probably only been harvesting CGSpace since we did the migration, as library.cgiar.org directs to us now

  • - -
  • For now I will just contact them to have them update their contact info in the bot’s user agent, but eventually I think I’ll tell them to swap out the CGIAR Library entry for CGSpace

  • + - -

    2017-10-30

    - +
    137.108.70.6 - - [29/Oct/2017:07:39:49 +0000] "GET /discover?filtertype_0=type&filter_relational_operator_0=equals&filter_0=Internal+Document&filtertype=author&filter_relational_operator=equals&filter=CGIAR+Secretariat HTTP/1.1" 200 7776 "-" "Mozilla/5.0 (compatible; CORE/0.6; +http://core.ac.uk; http://core.ac.uk/intro/contact)"
    +
    +

    2017-10-30

    dspace=# SELECT * FROM pg_stat_activity;
     ...
     (93 rows)
    -
    - -
  • Surprise surprise, the CORE bot is likely responsible for the recent load issues, making hundreds of thousands of requests yesterday and today:

    - +
    # grep -c "CORE/0.6" /var/log/nginx/access.log 
     26475
     # grep -c "CORE/0.6" /var/log/nginx/access.log.1
     135083
    -
  • - -
  • IP addresses for this bot currently seem to be:

    - +
    # grep "CORE/0.6" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq
     137.108.70.6
     137.108.70.7
    -
  • - -
  • I will add their user agent to the Tomcat Session Crawler Valve but it won’t help much because they are only using two sessions:

    - +
    # grep 137.108.70 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq
     session_id=5771742CABA3D0780860B8DA81E0551B
     session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
    -
  • - -
  • … and most of their requests are for dynamic discover pages:

    - +
    # grep -c 137.108.70 /var/log/nginx/access.log
     26622
     # grep 137.108.70 /var/log/nginx/access.log | grep -c "GET /discover"
     24055
    -
  • - -
  • Just because I’m curious who the top IPs are:

    - +
    # awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
    -496 62.210.247.93
    -571 46.4.94.226
    -651 40.77.167.39
    -763 157.55.39.231
    -782 207.46.13.90
    -998 66.249.66.90
    -1948 104.196.152.243
    -4247 190.19.92.5
    -31602 137.108.70.6
    -31636 137.108.70.7
    -
  • - -
  • At least we know the top two are CORE, but who are the others?

  • - -
  • 190.19.92.5 is apparently in Argentina, and 104.196.152.243 is from Google Cloud Engine

  • - -
  • Actually, these two scrapers might be more responsible for the heavy load than the CORE bot, because they don’t reuse their session variable, creating thousands of new sessions!

    - + 496 62.210.247.93 + 571 46.4.94.226 + 651 40.77.167.39 + 763 157.55.39.231 + 782 207.46.13.90 + 998 66.249.66.90 + 1948 104.196.152.243 + 4247 190.19.92.5 + 31602 137.108.70.6 + 31636 137.108.70.7 +
    # grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     1419
     # grep 104.196.152.243 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     2811
    -
  • - -
  • From looking at the requests, it appears these are from CIAT and CCAFS

  • - -
  • I wonder if I could somehow instruct them to use a user agent so that we could apply a crawler session manager valve to them

  • - -
  • Actually, according to the Tomcat docs, we could use an IP with crawlerIps: https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve

  • - -
  • Ah, wait, it looks like crawlerIps only came in 2017-06, so probably isn’t in Ubuntu 16.04’s 7.0.68 build!

  • - -
  • That would explain the errors I was getting when trying to set it:

    - -
    WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property 'crawlerIps' to '190\.19\.92\.5|104\.196\.152\.243' did not find a matching property.
    -
  • - -
  • As for now, it actually seems the CORE bot coming from 137.108.70.6 and 137.108.70.7 is only using a few sessions per day, which is good:

    - -
    # grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)' dspace.log.2017-10-30 | sort -n | uniq -c | sort -h
    -410 session_id=74F0C3A133DBF1132E7EC30A7E7E0D60:ip_addr=137.108.70.7
    -574 session_id=5771742CABA3D0780860B8DA81E0551B:ip_addr=137.108.70.7
    -1012 session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A:ip_addr=137.108.70.6
    -
  • - -
  • I will check again tomorrow

  • + - -

    2017-10-31

    - +
    WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property 'crawlerIps' to '190\.19\.92\.5|104\.196\.152\.243' did not find a matching property.
    +
    +
    # grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)' dspace.log.2017-10-30 | sort -n | uniq -c | sort -h
    +    410 session_id=74F0C3A133DBF1132E7EC30A7E7E0D60:ip_addr=137.108.70.7
    +    574 session_id=5771742CABA3D0780860B8DA81E0551B:ip_addr=137.108.70.7
    +   1012 session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A:ip_addr=137.108.70.6
    +
    +

    2017-10-31

    # grep "CORE/0.6" /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h
    -139109 137.108.70.6
    -139253 137.108.70.7
    -
    - -
  • I’ve emailed the CORE people to ask if they can update the repository information from CGIAR Library to CGSpace

  • - -
  • Also, I asked if they could perhaps use the sitemap.xml, OAI-PMH, or REST APIs to index us more efficiently, because they mostly seem to be crawling the nearly endless Discovery facets

  • - -
  • I added GoAccess to the list of package to install in the DSpace role of the Ansible infrastructure scripts

  • - -
  • It makes it very easy to analyze nginx logs from the command line, to see where traffic is coming from:

    - + 139109 137.108.70.6 + 139253 137.108.70.7 +
    # goaccess /var/log/nginx/access.log --log-format=COMBINED
    -
  • - -
  • According to Uptime Robot CGSpace went down and up a few times

  • - -
  • I had a look at goaccess and I saw that CORE was actively indexing

  • - -
  • Also, PostgreSQL connections were at 91 (with the max being 60 per web app, hmmm)

  • - -
  • I’m really starting to get annoyed with these guys, and thinking about blocking their IP address for a few days to see if CGSpace becomes more stable

  • - -
  • Actually, come to think of it, they aren’t even obeying robots.txt, because we actually disallow /discover and /search-filter URLs but they are hitting those massively:

    - +
    # grep "CORE/0.6" /var/log/nginx/access.log | grep -o -E "GET /(discover|search-filter)" | sort -n | uniq -c | sort -rn 
    -158058 GET /discover
    -14260 GET /search-filter
    -
  • - -
  • I tested a URL of pattern /discover in Google’s webmaster tools and it was indeed identified as blocked

  • - -
  • I will send feedback to the CORE bot team

  • + 158058 GET /discover + 14260 GET /search-filter + diff --git a/docs/2017-11/index.html b/docs/2017-11/index.html index 9be6c30c7..5fd9859e8 100644 --- a/docs/2017-11/index.html +++ b/docs/2017-11/index.html @@ -8,24 +8,19 @@ @@ -36,26 +31,21 @@ COPY 54701 - + @@ -136,83 +126,63 @@ COPY 54701

    -

    2017-11-01

    - +

    2017-11-01

    - -

    2017-11-02

    - +

    2017-11-02

    # grep -c "CORE" /var/log/nginx/access.log
     0
    -
    - -
  • Generate list of authors on CGSpace for Peter to go through and correct:

    - +
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
     COPY 54701
    -
  • - - -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "7/Nov/2017" | grep -c Baiduspider
     8912
     # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "7/Nov/2017" | grep Baiduspider | grep -c -E "GET /(browse|discover|search-filter)"
     2521
    -
    - -
  • According to their documentation their bot respects robots.txt, but I don’t see this being the case

  • - -
  • I think I will end up blocking Baidu as well…

  • - -
  • Next is for me to look and see what was happening specifically at 3AM and 7AM when the server crashed

  • - -
  • I should look in nginx access.log, rest.log, oai.log, and DSpace’s dspace.log.2017-11-07

  • - -
  • Here are the top IPs making requests to XMLUI from 2 to 8 AM:

    - +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -279 66.249.66.91
    -373 65.49.68.199
    -446 68.180.229.254
    -470 104.196.152.243
    -470 197.210.168.174
    -598 207.46.13.103
    -603 157.55.39.161
    -637 207.46.13.80
    -703 207.46.13.36
    -724 66.249.66.90
    -
  • - -
  • Of those, most are Google, Bing, Yahoo, etc, except 63.143.42.244 and 63.143.42.242 which are Uptime Robot

  • - -
  • Here are the top IPs making requests to REST from 2 to 8 AM:

    - + 279 66.249.66.91 + 373 65.49.68.199 + 446 68.180.229.254 + 470 104.196.152.243 + 470 197.210.168.174 + 598 207.46.13.103 + 603 157.55.39.161 + 637 207.46.13.80 + 703 207.46.13.36 + 724 66.249.66.90 +
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -  8 207.241.229.237
    - 10 66.249.66.90
    - 16 104.196.152.243
    - 25 41.60.238.61
    - 26 157.55.39.161
    - 27 207.46.13.103
    - 27 207.46.13.80
    - 31 207.46.13.36
    -1498 50.116.102.77
    -
  • - -
  • The OAI requests during that same time period are nothing to worry about:

    - + 8 207.241.229.237 + 10 66.249.66.90 + 16 104.196.152.243 + 25 41.60.238.61 + 26 157.55.39.161 + 27 207.46.13.103 + 27 207.46.13.80 + 31 207.46.13.36 + 1498 50.116.102.77 +
    # cat /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -  1 66.249.66.92
    -  4 66.249.66.90
    -  6 68.180.229.254
    -
  • - -
  • The top IPs from dspace.log during the 2–8 AM period:

    - + 1 66.249.66.92 + 4 66.249.66.90 + 6 68.180.229.254 +
    $ grep -E '2017-11-07 0[2-8]' dspace.log.2017-11-07 | grep -o -E 'ip_addr=[0-9.]+' | sort -n | uniq -c | sort -h | tail
    -143 ip_addr=213.55.99.121
    -181 ip_addr=66.249.66.91
    -223 ip_addr=157.55.39.161
    -248 ip_addr=207.46.13.80
    -251 ip_addr=207.46.13.103
    -291 ip_addr=207.46.13.36
    -297 ip_addr=197.210.168.174
    -312 ip_addr=65.49.68.199
    -462 ip_addr=104.196.152.243
    -488 ip_addr=66.249.66.90
    -
  • - -
  • These aren’t actually very interesting, as the top few are Google, CIAT, Bingbot, and a few other unknown scrapers

  • - -
  • The number of requests isn’t even that high to be honest

  • - -
  • As I was looking at these logs I noticed another heavy user (124.17.34.59) that was not active during this time period, but made many requests today alone:

    - + 143 ip_addr=213.55.99.121 + 181 ip_addr=66.249.66.91 + 223 ip_addr=157.55.39.161 + 248 ip_addr=207.46.13.80 + 251 ip_addr=207.46.13.103 + 291 ip_addr=207.46.13.36 + 297 ip_addr=197.210.168.174 + 312 ip_addr=65.49.68.199 + 462 ip_addr=104.196.152.243 + 488 ip_addr=66.249.66.90 +
    # zgrep -c 124.17.34.59 /var/log/nginx/access.log*
     /var/log/nginx/access.log:22581
     /var/log/nginx/access.log.1:0
    @@ -472,189 +403,140 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
     /var/log/nginx/access.log.7.gz:0
     /var/log/nginx/access.log.8.gz:0
     /var/log/nginx/access.log.9.gz:1
    -
  • - -
  • The whois data shows the IP is from China, but the user agent doesn’t really give any clues:

    - +
    # grep 124.17.34.59 /var/log/nginx/access.log | awk -F'" ' '{print $3}' | sort | uniq -c | sort -h
    -210 "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
    -22610 "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)"
    -
  • - -
  • A Google search for “LCTE bot” doesn’t return anything interesting, but this Stack Overflow discussion references the lack of information

  • - -
  • So basically after a few hours of looking at the log files I am not closer to understanding what is going on!

  • - -
  • I do know that we want to block Baidu, though, as it does not respect robots.txt

  • - -
  • And as we speak Linode alerted that the outbound traffic rate is very high for the past two hours (about 12–14 hours)

  • - -
  • At least for now it seems to be that new Chinese IP (124.17.34.59):

    - + 210 "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36" + 22610 "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)" +
    # grep -E "07/Nov/2017:1[234]:" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -198 207.46.13.103
    -203 207.46.13.80
    -205 207.46.13.36
    -218 157.55.39.161
    -249 45.5.184.221
    -258 45.5.187.130
    -386 66.249.66.90
    -410 197.210.168.174
    -1896 104.196.152.243
    -11005 124.17.34.59
    -
  • - -
  • Seems 124.17.34.59 are really downloading all our PDFs, compared to the next top active IPs during this time!

    - + 198 207.46.13.103 + 203 207.46.13.80 + 205 207.46.13.36 + 218 157.55.39.161 + 249 45.5.184.221 + 258 45.5.187.130 + 386 66.249.66.90 + 410 197.210.168.174 + 1896 104.196.152.243 + 11005 124.17.34.59 +
    # grep -E "07/Nov/2017:1[234]:" /var/log/nginx/access.log | grep 124.17.34.59 | grep -c pdf
     5948
     # grep -E "07/Nov/2017:1[234]:" /var/log/nginx/access.log | grep 104.196.152.243 | grep -c pdf
     0
    -
  • - -
  • About CIAT, I think I need to encourage them to specify a user agent string for their requests, because they are not reuising their Tomcat session and they are creating thousands of sessions per day

  • - -
  • All CIAT requests vs unique ones:

    - +
    $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-11-07 | wc -l
     3506
     $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-11-07 | sort | uniq | wc -l
     3506
    -
  • - -
  • I emailed CIAT about the session issue, user agent issue, and told them they should not scrape the HTML contents of communities, instead using the REST API

  • - -
  • About Baidu, I found a link to their robots.txt tester tool

  • - -
  • It seems like our robots.txt file is valid, and they claim to recognize that URLs like /discover should be forbidden (不允许, aka “not allowed”):

  • + - -

    Baidu robots.txt tester

    - +

    Baidu robots.txt tester

    180.76.15.136 - - [07/Nov/2017:06:25:11 +0000] "GET /discover?filtertype_0=crpsubject&filter_relational_operator_0=equals&filter_0=WATER%2C+LAND+AND+ECOSYSTEMS&filtertype=subject&filter_relational_operator=equals&filter=WATER+RESOURCES HTTP/1.1" 200 82265 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
    -
    - -
  • Along with another thousand or so requests to URLs that are forbidden in robots.txt today alone:

    - +
    # grep -c Baiduspider /var/log/nginx/access.log
     3806
     # grep Baiduspider /var/log/nginx/access.log | grep -c -E "GET /(browse|discover|search-filter)"
     1085
    -
  • - -
  • I will think about blocking their IPs but they have 164 of them!

    - +
    # grep "Baiduspider/2.0" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq | wc -l
     164
    -
  • - - -

    2017-11-08

    - +

    2017-11-08

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "0[78]/Nov/2017:" | grep 124.17.34.59 | grep -v pdf.jpg | grep -c pdf
     24981
    -
    - -
  • This is about 20,000 Tomcat sessions:

    - +
    $ cat dspace.log.2017-11-07 dspace.log.2017-11-08 | grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=124.17.34.59' | sort | uniq | wc -l
     20733
    -
  • - -
  • I’m getting really sick of this

  • - -
  • Sisay re-uploaded the CIAT records that I had already corrected earlier this week, erasing all my corrections

  • - -
  • I had to re-correct all the publishers, places, names, dates, etc and apply the changes on DSpace Test

  • - -
  • Run system updates on DSpace Test and reboot the server

  • - -
  • Magdalena had written to say that two of their Phase II project tags were missing on CGSpace, so I added them (#346)

  • - -
  • I figured out a way to use nginx’s map function to assign a “bot” user agent to misbehaving clients who don’t define a user agent

  • - -
  • Most bots are automatically lumped into one generic session by Tomcat’s Crawler Session Manager Valve but this only works if their user agent matches a pre-defined regular expression like .*[bB]ot.*

  • - -
  • Some clients send thousands of requests without a user agent which ends up creating thousands of Tomcat sessions, wasting precious memory, CPU, and database resources in the process

  • - -
  • Basically, we modify the nginx config to add a mapping with a modified user agent $ua:

    - +
    map $remote_addr $ua {
    -# 2017-11-08 Random Chinese host grabbing 20,000 PDFs
    -124.17.34.59     'ChineseBot';
    -default          $http_user_agent;
    +    # 2017-11-08 Random Chinese host grabbing 20,000 PDFs
    +    124.17.34.59     'ChineseBot';
    +    default          $http_user_agent;
     }
    -
  • - -
  • If the client’s address matches then the user agent is set, otherwise the default $http_user_agent variable is used

  • - -
  • Then, in the server’s / block we pass this header to Tomcat:

    - +
    proxy_pass http://tomcat_http;
     proxy_set_header User-Agent $ua;
    -
  • - -
  • Note to self: the $ua variable won’t show up in nginx access logs because the default combined log format doesn’t show it, so don’t run around pulling your hair out wondering with the modified user agents aren’t showing in the logs!

  • - -
  • If a client matching one of these IPs connects without a session, it will be assigned one by the Crawler Session Manager Valve

  • - -
  • You can verify by cross referencing nginx’s access.log and DSpace’s dspace.log.2017-11-08, for example

  • - -
  • I will deploy this on CGSpace later this week

  • - -
  • I am interested to check how this affects the number of sessions used by the CIAT and Chinese bots (see above on 2017-11-07 for example)

  • - -
  • I merged the clickable thumbnails code to 5_x-prod (#347) and will deploy it later along with the new bot mapping stuff (and re-run the Asible nginx and tomcat tags)

  • - -
  • I was thinking about Baidu again and decided to see how many requests they have versus Google to URL paths that are explicitly forbidden in robots.txt:

    - +
    # zgrep Baiduspider /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)"
     22229
     # zgrep Googlebot /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)"
     0
    -
  • - -
  • It seems that they rarely even bother checking robots.txt, but Google does multiple times per day!

    - +
    # zgrep Baiduspider /var/log/nginx/access.log* | grep -c robots.txt
     14
     # zgrep Googlebot  /var/log/nginx/access.log* | grep -c robots.txt
     1134
    -
  • - -
  • I have been looking for a reason to ban Baidu and this is definitely a good one

  • - -
  • Disallowing Baiduspider in robots.txt probably won’t work because this bot doesn’t seem to respect the robot exclusion standard anyways!

  • - -
  • I will whip up something in nginx later

  • - -
  • Run system updates on CGSpace and reboot the server

  • - -
  • Re-deploy latest 5_x-prod branch on CGSpace and DSpace Test (includes the clickable thumbnails, CCAFS phase II project tags, and updated news text)

  • + - -

    2017-11-09

    - +

    2017-11-09

    # zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '09/Nov/2017' | grep -c 104.196.152.243
     8956
     $ grep 104.196.152.243 dspace.log.2017-11-09 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     223
    -
    - -
  • Versus the same stats for yesterday and the day before:

    - +
    # zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '08/Nov/2017' | grep -c 104.196.152.243 
     10216
     $ grep 104.196.152.243 dspace.log.2017-11-08 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    @@ -663,82 +545,65 @@ $ grep 104.196.152.243 dspace.log.2017-11-08 | grep -o -E 'session_id=[A-Z0-9]{3
     8120
     $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     3506
    -
  • - -
  • The number of sessions is over ten times less!

  • - -
  • This gets me thinking, I wonder if I can use something like nginx’s rate limiter to automatically change the user agent of clients who make too many requests

  • - -
  • Perhaps using a combination of geo and map, like illustrated here: https://www.nginx.com/blog/rate-limiting-nginx/

  • + - -

    2017-11-11

    - +

    2017-11-11

    - -

    2017-11-12

    - +

    2017-11-12

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "12/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -243 5.83.120.111
    -335 40.77.167.103
    -424 66.249.66.91
    -529 207.46.13.36
    -554 40.77.167.129
    -604 207.46.13.53
    -754 104.196.152.243
    -883 66.249.66.90
    -1150 95.108.181.88
    -1381 5.9.6.51
    -
    - -
  • 5.9.6.51 seems to be a Russian bot:

    - + 243 5.83.120.111 + 335 40.77.167.103 + 424 66.249.66.91 + 529 207.46.13.36 + 554 40.77.167.129 + 604 207.46.13.53 + 754 104.196.152.243 + 883 66.249.66.90 + 1150 95.108.181.88 + 1381 5.9.6.51 +
    # grep 5.9.6.51 /var/log/nginx/access.log | tail -n 1
     5.9.6.51 - - [12/Nov/2017:08:13:13 +0000] "GET /handle/10568/16515/recent-submissions HTTP/1.1" 200 5097 "-" "Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)"
    -
  • - -
  • What’s amazing is that it seems to reuse its Java session across all requests:

    - +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2017-11-12
     1558
     $ grep 5.9.6.51 dspace.log.2017-11-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     1
    -
  • - -
  • Bravo to MegaIndex.ru!

  • - -
  • The same cannot be said for 95.108.181.88, which appears to be YandexBot, even though Tomcat’s Crawler Session Manager valve regex should match ‘YandexBot’:

    - +
    # grep 95.108.181.88 /var/log/nginx/access.log | tail -n 1
     95.108.181.88 - - [12/Nov/2017:08:33:17 +0000] "GET /bitstream/handle/10568/57004/GenebankColombia_23Feb2015.pdf HTTP/1.1" 200 972019 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
     $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2017-11-12
     991
    -
  • - -
  • Move some items and collections on CGSpace for Peter Ballantyne, running move_collections.sh with the following configuration:

    - +
    10947/6    10947/1 10568/83389
     10947/34   10947/1 10568/83389
     10947/2512 10947/1 10568/83389
    -
  • - -
  • I explored nginx rate limits as a way to aggressively throttle Baidu bot which doesn’t seem to respect disallowed URLs in robots.txt

  • - -
  • There’s an interesting blog post from Nginx’s team about rate limiting as well as a clever use of mapping with rate limits

  • - -
  • The solution I came up with uses tricks from both of those

  • - -
  • I deployed the limit on CGSpace and DSpace Test and it seems to work well:

    - +
    $ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
     HTTP/1.1 200 OK
     Connection: keep-alive
    @@ -761,28 +626,21 @@ Content-Length: 206
     Content-Type: text/html
     Date: Sun, 12 Nov 2017 16:30:21 GMT
     Server: nginx
    -
  • - -
  • The first request works, second is denied with an HTTP 503!

  • - -
  • I need to remember to check the Munin graphs for PostgreSQL and JVM next week to see how this affects them

  • + - -

    2017-11-13

    - +

    2017-11-13

    # zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "13/Nov/2017" | grep "Baiduspider" | grep -c " 200 "
     1132
     # zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "13/Nov/2017" | grep "Baiduspider" | grep -c " 503 "
     10105
    -
    - -
  • Helping Sisay proof 47 records for IITA: https://dspacetest.cgiar.org/handle/10568/97029

  • - -
  • From looking at the data in OpenRefine I found:

    - + - -

    2017-11-14

    - +
  • +
  • After uploading and looking at the data in DSpace Test I saw more errors with CRPs, subjects (one item had four copies of all of its subjects, another had a “.” in it), affiliations, sponsors, etc.
  • +
  • Atmire responded to the ticket about ORCID stuff a few days ago, today I told them that I need to talk to Peter and the partners to see what we would like to do
  • + +

    2017-11-14

    $ psql dspace6
     dspace6=# CREATE EXTENSION pgcrypto;
    -
    - -
  • Also, local settings are no longer in build.properties, they are now in local.cfg

  • - -
  • I’m not sure if we can use separate profiles like we did before with mvn -Denv=blah to use blah.properties

  • - -
  • It seems we need to use “system properties” to override settings, ie: -Ddspace.dir=/Users/aorth/dspace6

  • + - -

    2017-11-15

    - +

    2017-11-15

    - -

    2017-11-17

    - +

    2017-11-17

    - -

    Jconsole sessions for XMLUI

    - +
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep "17/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +     13 66.249.66.223
    +     14 207.46.13.36
    +     17 207.46.13.137
    +     22 207.46.13.23
    +     23 66.249.66.221
    +     92 66.249.66.219
    +    187 104.196.152.243
    +   1400 70.32.83.92
    +   1503 50.116.102.77
    +   6037 45.5.184.196
    +# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "17/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +    325 139.162.247.24
    +    354 66.249.66.223
    +    422 207.46.13.36
    +    434 207.46.13.23
    +    501 207.46.13.137
    +    647 66.249.66.221
    +    662 34.192.116.178
    +    762 213.55.99.121
    +   1867 104.196.152.243
    +   2020 66.249.66.219
    +
    +
    $ jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=7777 service:jmx:rmi:///jndi/rmi://localhost:9000/jmxrmi -J-DsocksNonProxyHosts=
    +
    +

    Jconsole sessions for XMLUI

    - -

    2017-11-19

    - +

    2017-11-19

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "19/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -111 66.249.66.155
    -171 5.9.6.51
    -188 54.162.241.40
    -229 207.46.13.23
    -233 207.46.13.137
    -247 40.77.167.6
    -251 207.46.13.36
    -275 68.180.229.254
    -325 104.196.152.243
    -1610 66.249.66.153
    -
    - -
  • 66.249.66.153 appears to be Googlebot:

    - + 111 66.249.66.155 + 171 5.9.6.51 + 188 54.162.241.40 + 229 207.46.13.23 + 233 207.46.13.137 + 247 40.77.167.6 + 251 207.46.13.36 + 275 68.180.229.254 + 325 104.196.152.243 + 1610 66.249.66.153 +
    66.249.66.153 - - [19/Nov/2017:06:26:01 +0000] "GET /handle/10568/2203 HTTP/1.1" 200 6309 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
    -
  • - -
  • We know Googlebot is persistent but behaves well, so I guess it was just a coincidence that it came at a time when we had other traffic and server activity

  • - -
  • In related news, I see an Atmire update process going for many hours and responsible for hundreds of thousands of log entries (two thirds of all log entries)

    - +
    $ wc -l dspace.log.2017-11-19 
     388472 dspace.log.2017-11-19
     $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19 
     267494
    -
  • - -
  • WTF is this process doing every day, and for so many hours?

  • - -
  • In unrelated news, when I was looking at the DSpace logs I saw a bunch of errors like this:

    - +
    2017-11-19 03:00:32,806 INFO  org.apache.pdfbox.pdfparser.PDFParser @ Document is encrypted
     2017-11-19 03:00:32,807 ERROR org.apache.pdfbox.filter.FlateFilter @ FlateFilter: stop reading corrupt stream due to a DataFormatException
    -
  • - -
  • It’s been a few days since I enabled the G1GC on DSpace Test and the JVM graph definitely changed:

  • + - -

    Tomcat G1GC

    - -

    2017-11-20

    - +

    Tomcat G1GC

    +

    2017-11-20

    - -

    2017-11-21

    - +

    2017-11-21

    - -

    2017-11-22

    - +
    2017-11-21 11:11:09,621 WARN  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=2FEC0E5286C17B6694567FFD77C3171C:ip_addr=77.241.141.58:ldap_authentication:type=failed_auth javax.naming.CommunicationException\colon; simple bind failed\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is javax.net.ssl.SSLHandshakeException\colon; sun.security.validator.ValidatorException\colon; PKIX path validation failed\colon; java.security.cert.CertPathValidatorException\colon; validity check failed]
    +

    2017-11-22

    - -

    Tomcat JVM with CMS GC

    - -

    2017-11-23

    - +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "22/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +    136 31.6.77.23
    +    174 68.180.229.254
    +    217 66.249.66.91
    +    256 157.55.39.79
    +    268 54.144.57.183
    +    281 207.46.13.137
    +    282 207.46.13.36
    +    290 207.46.13.23
    +    696 66.249.66.90
    +    707 104.196.152.243
    +
    +

    Tomcat JVM with CMS GC

    +

    2017-11-23

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "23/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    - 88 66.249.66.91
    -140 68.180.229.254
    -155 54.196.2.131
    -182 54.224.164.166
    -301 157.55.39.79
    -315 207.46.13.36
    -331 207.46.13.23
    -358 207.46.13.137
    -565 104.196.152.243
    -1570 66.249.66.90
    -
    - -
  • … and the usual REST scrapers from CIAT (45.5.184.196) and CCAFS (70.32.83.92):

    - + 88 66.249.66.91 + 140 68.180.229.254 + 155 54.196.2.131 + 182 54.224.164.166 + 301 157.55.39.79 + 315 207.46.13.36 + 331 207.46.13.23 + 358 207.46.13.137 + 565 104.196.152.243 + 1570 66.249.66.90 +
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E "23/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -  5 190.120.6.219
    -  6 104.198.9.108
    - 14 104.196.152.243
    - 21 112.134.150.6
    - 22 157.55.39.79
    - 22 207.46.13.137
    - 23 207.46.13.36
    - 26 207.46.13.23
    -942 45.5.184.196
    -3995 70.32.83.92
    -
  • - -
  • These IPs crawling the REST API don’t specify user agents and I’d assume they are creating many Tomcat sessions

  • - -
  • I would catch them in nginx to assign a “bot” user agent to them so that the Tomcat Crawler Session Manager valve could deal with them, but they seem to create any really — at least not in the dspace.log:

    - + 5 190.120.6.219 + 6 104.198.9.108 + 14 104.196.152.243 + 21 112.134.150.6 + 22 157.55.39.79 + 22 207.46.13.137 + 23 207.46.13.36 + 26 207.46.13.23 + 942 45.5.184.196 + 3995 70.32.83.92 +
    $ grep 70.32.83.92 dspace.log.2017-11-23 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     2
    -
  • - -
  • I’m wondering if REST works differently, or just doesn’t log these sessions?

  • - -
  • I wonder if they are measurable via JMX MBeans?

  • - -
  • I did some tests locally and I don’t see the sessionCounter incrementing after making requests to REST, but it does with XMLUI and OAI

  • - -
  • I came across some interesting PostgreSQL tuning advice for SSDs: https://amplitude.engineering/how-a-single-postgresql-config-change-improved-slow-query-performance-by-50x-85593b8991b0

  • - -
  • Apparently setting random_page_cost to 1 is “common” advice for systems running PostgreSQL on SSD (the default is 4)

  • - -
  • So I deployed this on DSpace Test and will check the Munin PostgreSQL graphs in a few days to see if anything changes

  • + - -

    2017-11-24

    - +

    2017-11-24

    - -

    PostgreSQL connections after tweak (week)

    - +

    PostgreSQL connections after tweak (week)

    - -

    PostgreSQL connections after tweak (month)

    - +

    PostgreSQL connections after tweak (month)

    - -

    2017-11-26

    - +
    192.156.137.184 - - [24/Nov/2017:20:33:58 +0000] "HEAD / HTTP/1.1" 301 0 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.27.1 zlib/1.2.3 libidn/1.18 libssh2/1.4.2"
    +
    +

    2017-11-26

    - -

    2017-11-29

    - +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "26/Nov/2017:0[567]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +    190 66.249.66.83
    +    195 104.196.152.243
    +    220 40.77.167.82
    +    246 207.46.13.137
    +    247 68.180.229.254
    +    257 157.55.39.214
    +    289 66.249.66.91
    +    298 157.55.39.206
    +    379 66.249.66.70
    +   1855 66.249.66.90
    +

    2017-11-29

    # cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "29/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -540 66.249.66.83
    -659 40.77.167.36
    -663 157.55.39.214
    -681 157.55.39.206
    -733 157.55.39.158
    -850 66.249.66.70
    -1311 66.249.66.90
    -1340 104.196.152.243
    -4008 70.32.83.92
    -6053 45.5.184.196
    -
    - -
  • PostgreSQL activity shows 69 connections

  • - -
  • I don’t have time to troubleshoot more as I’m in Nairobi working on the HPC so I just restarted Tomcat for now

  • - -
  • A few hours later Uptime Robot says the server is down again

  • - -
  • I don’t see much activity in the logs but there are 87 PostgreSQL connections

  • - -
  • But shit, there were 10,000 unique Tomcat sessions today:

    - + 540 66.249.66.83 + 659 40.77.167.36 + 663 157.55.39.214 + 681 157.55.39.206 + 733 157.55.39.158 + 850 66.249.66.70 + 1311 66.249.66.90 + 1340 104.196.152.243 + 4008 70.32.83.92 + 6053 45.5.184.196 +
    $ cat dspace.log.2017-11-29 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     10037
    -
  • - -
  • Although maybe that’s not much, as the previous two days had more:

    - +
    $ cat dspace.log.2017-11-27 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     12377
     $ cat dspace.log.2017-11-28 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     16984
    -
  • - -
  • I think we just need start increasing the number of allowed PostgreSQL connections instead of fighting this, as it’s the most common source of crashes we have

  • - -
  • I will bump DSpace’s db.maxconnections from 60 to 90, and PostgreSQL’s max_connections from 183 to 273 (which is using my loose formula of 90 * webapps + 3)

  • - -
  • I really need to figure out how to get DSpace to use a PostgreSQL connection pool

  • + - -

    2017-11-30

    - +

    2017-11-30

    diff --git a/docs/2017-12/index.html b/docs/2017-12/index.html index 29d7a237d..efc83b4f6 100644 --- a/docs/2017-12/index.html +++ b/docs/2017-12/index.html @@ -8,7 +8,6 @@ - + @@ -110,15 +108,13 @@ The list of connections to XMLUI and REST API for today:

    -

    2017-12-01

    - +

    2017-12-01

    -
    # cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "1/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         763 2.86.122.76
         907 207.46.13.94
    @@ -130,213 +126,170 @@ The list of connections to XMLUI and REST API for today:
        1805 66.249.66.90
        4007 70.32.83.92
        6061 45.5.184.196
    -
    - - - -

    2017-12-03

    - + 3 54.75.205.145 + 6 70.32.83.92 + 14 2a01:7e00::f03c:91ff:fe18:7396 + 46 2001:4b99:1:1:216:3eff:fe2c:dc6c + 319 2001:4b99:1:1:216:3eff:fe76:205b +

    2017-12-03

    - -

    2017-12-04

    - +

    2017-12-04

    - -

    DSpace Test PostgreSQL connections month

    - +

    DSpace Test PostgreSQL connections month

    - -

    CGSpace PostgreSQL connections month

    - -

    2017-12-05

    - +

    CGSpace PostgreSQL connections month

    +

    2017-12-05

    - -

    2017-12-06

    - +

    2017-12-06

    - -

    2017-12-07

    - +
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E "6/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +     18 95.108.181.88
    +     19 68.180.229.254
    +     30 207.46.13.151
    +     33 207.46.13.110
    +     38 40.77.167.20
    +     41 157.55.39.223
    +     82 104.196.152.243
    +   1529 50.116.102.77
    +   4005 70.32.83.92
    +   6045 45.5.184.196
    +
    +

    2017-12-07

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "7/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 
    -838 40.77.167.11
    -939 66.249.66.223
    -1149 66.249.66.206
    -1316 207.46.13.110
    -1322 207.46.13.151
    -1323 2001:da8:203:2224:c912:1106:d94f:9189
    -1414 157.55.39.223
    -2378 104.196.152.243
    -2662 66.249.66.219
    -5110 124.17.34.60
    -
    - -
  • We’ve never seen 124.17.34.60 yet, but it’s really hammering us!

  • - -
  • Apparently it is from China, and here is one of its user agents:

    - + 838 40.77.167.11 + 939 66.249.66.223 + 1149 66.249.66.206 + 1316 207.46.13.110 + 1322 207.46.13.151 + 1323 2001:da8:203:2224:c912:1106:d94f:9189 + 1414 157.55.39.223 + 2378 104.196.152.243 + 2662 66.249.66.219 + 5110 124.17.34.60 +
    Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)
    -
  • - -
  • It is responsible for 4,500 Tomcat sessions today alone:

    - +
    $ grep 124.17.34.60 /home/cgspace.cgiar.org/log/dspace.log.2017-12-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     4574
    -
  • - -
  • I’ve adjusted the nginx IP mapping that I set up last month to account for 124.17.34.60 and 124.17.34.59 using a regex, as it’s the same bot on the same subnet

  • - -
  • I was running the DSpace cleanup task manually and it hit an error:

    - +
    $ /home/cgspace.cgiar.org/bin/dspace cleanup -v
     ...
     Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -Detail: Key (bitstream_id)=(144666) is still referenced from table "bundle".
    -
  • - -
  • The solution is like I discovered in 2017-04, to set the primary_bitstream_id to null:

    - + Detail: Key (bitstream_id)=(144666) is still referenced from table "bundle". +
    dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (144666);
     UPDATE 1
    -
  • - - -

    2017-12-13

    - +

    2017-12-13

    - -

    2017-12-16

    - +

    2017-12-16

    - -

    2017-12-17

    - +

    2017-12-17

    + +
  • I did a test import of the data locally after building with SAFBuilder but for some reason I had to specify the collection (even though the collections were specified in the collection field)
  • +
    $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/89338 --source /Users/aorth/Downloads/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat --mapfile=/tmp/ccafs.map &> /tmp/ccafs.log
    -
    - -
  • It’s the same on DSpace Test, I can’t import the SAF bundle without specifying the collection:

    - +
    $ dspace import --add --eperson=aorth@mjanja.ch --mapfile=/tmp/ccafs.map --source=/tmp/ccafs-2016/SimpleArchiveFormat
     No collections given. Assuming 'collections' file inside item directory
     Adding items from directory: /tmp/ccafs-2016/SimpleArchiveFormat
    @@ -344,151 +297,127 @@ Generating mapfile: /tmp/ccafs.map
     Processing collections file: collections
     Adding item from directory item_1
     java.lang.NullPointerException
    -    at org.dspace.app.itemimport.ItemImport.addItem(ItemImport.java:865)
    -    at org.dspace.app.itemimport.ItemImport.addItems(ItemImport.java:736)
    -    at org.dspace.app.itemimport.ItemImport.main(ItemImport.java:498)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -    at java.lang.reflect.Method.invoke(Method.java:498)
    -    at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    -    at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    +        at org.dspace.app.itemimport.ItemImport.addItem(ItemImport.java:865)
    +        at org.dspace.app.itemimport.ItemImport.addItems(ItemImport.java:736)
    +        at org.dspace.app.itemimport.ItemImport.main(ItemImport.java:498)
    +        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    +        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    +        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    +        at java.lang.reflect.Method.invoke(Method.java:498)
    +        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    +        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
     java.lang.NullPointerException
     Started: 1513521856014
     Ended: 1513521858573
     Elapsed time: 2 secs (2559 msecs)
    -
  • - -
  • I even tried to debug it by adding verbose logging to the JAVA_OPTS:

    - +
    -Dlog4j.configuration=file:/Users/aorth/dspace/config/log4j-console.properties -Ddspace.log.init.disable=true
    -
  • - -
  • … but the error message was the same, just with more INFO noise around it

  • - -
  • For now I’ll import into a collection in DSpace Test but I’m really not sure what’s up with this!

  • - -
  • Linode alerted that CGSpace was using high CPU from 4 to 6 PM

  • - -
  • The logs for today show the CORE bot (137.108.70.7) being active in XMLUI:

    - +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "17/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -671 66.249.66.70
    -885 95.108.181.88
    -904 157.55.39.96
    -923 157.55.39.179
    -1159 207.46.13.107
    -1184 104.196.152.243
    -1230 66.249.66.91
    -1414 68.180.229.254
    -4137 66.249.66.90
    -46401 137.108.70.7
    -
  • - -
  • And then some CIAT bot (45.5.184.196) is actively hitting API endpoints:

    - + 671 66.249.66.70 + 885 95.108.181.88 + 904 157.55.39.96 + 923 157.55.39.179 + 1159 207.46.13.107 + 1184 104.196.152.243 + 1230 66.249.66.91 + 1414 68.180.229.254 + 4137 66.249.66.90 + 46401 137.108.70.7 +
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "17/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    - 33 68.180.229.254
    - 48 157.55.39.96
    - 51 157.55.39.179
    - 56 207.46.13.107
    -102 104.196.152.243
    -102 66.249.66.90
    -691 137.108.70.7
    -1531 50.116.102.77
    -4014 70.32.83.92
    -11030 45.5.184.196
    -
  • - -
  • That’s probably ok, as I don’t think the REST API connections use up a Tomcat session…

  • - -
  • CIP emailed a few days ago to ask about unique IDs for authors and organizations, and if we can provide them via an API

  • - -
  • Regarding the import issue above it seems to be a known issue that has a patch in DSpace 5.7:

    - + 33 68.180.229.254 + 48 157.55.39.96 + 51 157.55.39.179 + 56 207.46.13.107 + 102 104.196.152.243 + 102 66.249.66.90 + 691 137.108.70.7 + 1531 50.116.102.77 + 4014 70.32.83.92 + 11030 45.5.184.196 + - -

    2017-12-18

    - +
  • +
  • We're on DSpace 5.5 but there is a one-word fix to the addItem() function here: https://github.com/DSpace/DSpace/pull/1731
  • +
  • I will apply it on our branch but I need to make a note to NOT cherry-pick it when I rebase on to the latest 5.x upstream later
  • +
  • Pull request: #351
  • + +

    2017-12-18

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -190 207.46.13.146
    -191 197.210.168.174
    -202 86.101.203.216
    -268 157.55.39.134
    -297 66.249.66.91
    -314 213.55.99.121
    -402 66.249.66.90
    -532 68.180.229.254
    -644 104.196.152.243
    -32220 137.108.70.7
    -
    - -
  • On the API side (REST and OAI) there is still the same CIAT bot (45.5.184.196) from last night making quite a number of requests this morning:

    - + 190 207.46.13.146 + 191 197.210.168.174 + 202 86.101.203.216 + 268 157.55.39.134 + 297 66.249.66.91 + 314 213.55.99.121 + 402 66.249.66.90 + 532 68.180.229.254 + 644 104.196.152.243 + 32220 137.108.70.7 +
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -  7 104.198.9.108
    -  8 185.29.8.111
    -  8 40.77.167.176
    -  9 66.249.66.91
    -  9 68.180.229.254
    - 10 157.55.39.134
    - 15 66.249.66.90
    - 59 104.196.152.243
    -4014 70.32.83.92
    -8619 45.5.184.196
    -
  • - -
  • I need to keep an eye on this issue because it has nice fixes for reducing the number of database connections in DSpace 5.7: https://jira.duraspace.org/browse/DS-3551

  • - -
  • Update text on CGSpace about page to give some tips to developers about using the resources more wisely (#352)

  • - -
  • Linode alerted that CGSpace was using 396.3% CPU from 12 to 2 PM

  • - -
  • The REST and OAI API logs look pretty much the same as earlier this morning, but there’s a new IP harvesting XMLUI:

    - + 7 104.198.9.108 + 8 185.29.8.111 + 8 40.77.167.176 + 9 66.249.66.91 + 9 68.180.229.254 + 10 157.55.39.134 + 15 66.249.66.90 + 59 104.196.152.243 + 4014 70.32.83.92 + 8619 45.5.184.196 +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail            
    -360 95.108.181.88
    -477 66.249.66.90
    -526 86.101.203.216
    -691 207.46.13.13
    -698 197.210.168.174
    -819 207.46.13.146
    -878 68.180.229.254
    -1965 104.196.152.243
    -17701 2.86.72.181
    -52532 137.108.70.7
    -
  • - -
  • 2.86.72.181 appears to be from Greece, and has the following user agent:

    - + 360 95.108.181.88 + 477 66.249.66.90 + 526 86.101.203.216 + 691 207.46.13.13 + 698 197.210.168.174 + 819 207.46.13.146 + 878 68.180.229.254 + 1965 104.196.152.243 + 17701 2.86.72.181 + 52532 137.108.70.7 +
    Mozilla/3.0 (compatible; Indy Library)
    -
  • - -
  • Surprisingly it seems they are re-using their Tomcat session for all those 17,000 requests:

    - +
    $ grep 2.86.72.181 dspace.log.2017-12-18 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l                                                                                          
     1
    -
  • - -
  • I guess there’s nothing I can do to them for now

  • - -
  • In other news, I am curious how many PostgreSQL connection pool errors we’ve had in the last month:

    - +
    $ grep -c "Cannot get a connection, pool error Timeout waiting for idle object" dspace.log.2017-1* | grep -v :0
     dspace.log.2017-11-07:15695
     dspace.log.2017-11-08:135
    @@ -499,360 +428,278 @@ dspace.log.2017-11-29:3972
     dspace.log.2017-12-01:1601
     dspace.log.2017-12-02:1274
     dspace.log.2017-12-07:2769
    -
  • - -
  • I made a small fix to my move-collections.sh script so that it handles the case when a “to” or “from” community doesn’t exist

  • - -
  • The script lives here: https://gist.github.com/alanorth/e60b530ed4989df0c731afbb0c640515

  • - -
  • Major reorganization of four of CTA’s French collections

  • - -
  • Basically moving their items into the English ones, then moving the English ones to the top-level of the CTA community, and deleting the old sub-communities

  • - -
  • Move collection 1056851821 from 1056842212 to 1056842211

  • - -
  • Move collection 1056851400 from 1056842214 to 1056842211

  • - -
  • Move collection 1056856992 from 1056842216 to 1056842211

  • - -
  • Move collection 1056842218 from 1056842217 to 1056842211

  • - -
  • Export CSV of collection 1056863484 and move items to collection 1056851400

  • - -
  • Export CSV of collection 1056864403 and move items to collection 1056856992

  • - -
  • Export CSV of collection 1056856994 and move items to collection 1056842218

  • - -
  • There are blank lines in this metadata, which causes DSpace to not detect changes in the CSVs

  • - -
  • I had to use OpenRefine to remove all columns from the CSV except id and collection, and then update the collection field for the new mappings

  • - -
  • Remove empty sub-communities: 1056842212, 1056842214, 1056842216, 1056842217

  • - -
  • I was in the middle of applying the metadata imports on CGSpace and the system ran out of PostgreSQL connections…

  • - -
  • There were 128 PostgreSQL connections at the time… grrrr.

  • - -
  • So I restarted Tomcat 7 and restarted the imports

  • - -
  • I assume the PostgreSQL transactions were fine but I will remove the Discovery index for their community and re-run the light-weight indexing to hopefully re-construct everything:

    - +
    $ dspace index-discovery -r 10568/42211
     $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
    -
  • - -
  • The PostgreSQL issues are getting out of control, I need to figure out how to enable connection pools in Tomcat!

  • + - -

    2017-12-19

    - +

    2017-12-19

    - -

    Idle PostgreSQL connections on CGSpace

    - +

    Idle PostgreSQL connections on CGSpace

    2017-12-19 08:17:15,740 ERROR org.dspace.statistics.SolrLogger @ Error CREATEing SolrCore 'statistics-2010': Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
    -
    - -
  • I don’t have time now to look into this but the Solr sharding has long been an issue!

  • - -
  • Looking into using JDBC / JNDI to provide a database pool to DSpace

  • - -
  • The DSpace 6.x configuration docs have more notes about setting up the database pool than the 5.x ones (which actually have none!)

  • - -
  • First, I uncomment db.jndi in dspace/config/dspace.cfg

  • - -
  • Then I create a global Resource in the main Tomcat server.xml (inside GlobalNamingResources):

    - +
    <Resource name="jdbc/dspace" auth="Container" type="javax.sql.DataSource"
     	  driverClassName="org.postgresql.Driver"
     	  url="jdbc:postgresql://localhost:5432/dspace"
     	  username="dspace"
     	  password="dspace"
    -  initialSize='5'
    -  maxActive='50'
    -  maxIdle='15'
    -  minIdle='5'
    -  maxWait='5000'
    -  validationQuery='SELECT 1'
    -  testOnBorrow='true' />
    -
  • - -
  • Most of the parameters are from comments by Mark Wood about his JNDI setup: https://jira.duraspace.org/browse/DS-3564

  • - -
  • Then I add a ResourceLink to each web application context:

    - + initialSize='5' + maxActive='50' + maxIdle='15' + minIdle='5' + maxWait='5000' + validationQuery='SELECT 1' + testOnBorrow='true' /> +
    <ResourceLink global="jdbc/dspace" name="jdbc/dspace" type="javax.sql.DataSource"/>
    -
  • - -
  • I am not sure why several guides show configuration snippets for server.xml and web application contexts that use a Local and Global jdbc…

  • - -
  • When DSpace can’t find the JNDI context (for whatever reason) you will see this in the dspace logs:

    - +
    2017-12-19 13:12:08,796 ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspace
     javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Context. Unable to find [jdbc].
    -    at org.apache.naming.NamingContext.lookup(NamingContext.java:825)
    -    at org.apache.naming.NamingContext.lookup(NamingContext.java:173)
    -    at org.dspace.storage.rdbms.DatabaseManager.initDataSource(DatabaseManager.java:1414)
    -    at org.dspace.storage.rdbms.DatabaseManager.initialize(DatabaseManager.java:1331)
    -    at org.dspace.storage.rdbms.DatabaseManager.getDataSource(DatabaseManager.java:648)
    -    at org.dspace.storage.rdbms.DatabaseManager.getConnection(DatabaseManager.java:627)
    -    at org.dspace.core.Context.init(Context.java:121)
    -    at org.dspace.core.Context.<init>(Context.java:95)
    -    at org.dspace.app.util.AbstractDSpaceWebapp.register(AbstractDSpaceWebapp.java:79)
    -    at org.dspace.app.util.DSpaceContextListener.contextInitialized(DSpaceContextListener.java:128)
    -    at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:5110)
    -    at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5633)
    -    at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:145)
    -    at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:1015)
    -    at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:991)
    -    at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:652)
    -    at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:712)
    -    at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:2002)
    -    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    -    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    -    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    -    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    -    at java.lang.Thread.run(Thread.java:748)
    +        at org.apache.naming.NamingContext.lookup(NamingContext.java:825)
    +        at org.apache.naming.NamingContext.lookup(NamingContext.java:173)
    +        at org.dspace.storage.rdbms.DatabaseManager.initDataSource(DatabaseManager.java:1414)
    +        at org.dspace.storage.rdbms.DatabaseManager.initialize(DatabaseManager.java:1331)
    +        at org.dspace.storage.rdbms.DatabaseManager.getDataSource(DatabaseManager.java:648)
    +        at org.dspace.storage.rdbms.DatabaseManager.getConnection(DatabaseManager.java:627)
    +        at org.dspace.core.Context.init(Context.java:121)
    +        at org.dspace.core.Context.<init>(Context.java:95)
    +        at org.dspace.app.util.AbstractDSpaceWebapp.register(AbstractDSpaceWebapp.java:79)
    +        at org.dspace.app.util.DSpaceContextListener.contextInitialized(DSpaceContextListener.java:128)
    +        at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:5110)
    +        at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5633)
    +        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:145)
    +        at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:1015)
    +        at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:991)
    +        at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:652)
    +        at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:712)
    +        at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:2002)
    +        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    +        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    +        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    +        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    +        at java.lang.Thread.run(Thread.java:748)
     2017-12-19 13:12:08,798 INFO  org.dspace.storage.rdbms.DatabaseManager @ Unable to locate JNDI dataSource: jdbc/dspace
     2017-12-19 13:12:08,798 INFO  org.dspace.storage.rdbms.DatabaseManager @ Falling back to creating own Database pool
    -
  • - -
  • And indeed the Catalina logs show that it failed to set up the JDBC driver:

    - +
    org.apache.tomcat.dbcp.dbcp.SQLNestedException: Cannot load JDBC driver class 'org.postgresql.Driver'
    -
  • - -
  • There are several copies of the PostgreSQL driver installed by DSpace:

    - +
    $ find ~/dspace/ -iname "postgresql*jdbc*.jar"
     /Users/aorth/dspace/webapps/jspui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
     /Users/aorth/dspace/webapps/oai/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
     /Users/aorth/dspace/webapps/xmlui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
     /Users/aorth/dspace/webapps/rest/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
     /Users/aorth/dspace/lib/postgresql-9.1-901-1.jdbc4.jar
    -
  • - -
  • These apparently come from the main DSpace pom.xml:

    - +
    <dependency>
    -<groupId>postgresql</groupId>
    -<artifactId>postgresql</artifactId>
    -<version>9.1-901-1.jdbc4</version>
    +   <groupId>postgresql</groupId>
    +   <artifactId>postgresql</artifactId>
    +   <version>9.1-901-1.jdbc4</version>
     </dependency>
    -
  • - -
  • So WTF? Let’s try copying one to Tomcat’s lib folder and restarting Tomcat:

    - +
    $ cp ~/dspace/lib/postgresql-9.1-901-1.jdbc4.jar /usr/local/opt/tomcat@7/libexec/lib
    -
  • - -
  • Oh that’s fantastic, now at least Tomcat doesn’t print an error during startup so I guess it succeeds to create the JNDI pool

  • - -
  • DSpace starts up but I have no idea if it’s using the JNDI configuration because I see this in the logs:

    - +
    2017-12-19 13:26:54,271 INFO  org.dspace.storage.rdbms.DatabaseManager @ DBMS is '{}'PostgreSQL
     2017-12-19 13:26:54,277 INFO  org.dspace.storage.rdbms.DatabaseManager @ DBMS driver version is '{}'9.5.10
     2017-12-19 13:26:54,293 INFO  org.dspace.storage.rdbms.DatabaseUtils @ Loading Flyway DB migrations from: filesystem:/Users/aorth/dspace/etc/postgres, classpath:org.dspace.storage.rdbms.sqlmigration.postgres, classpath:org.dspace.storage.rdbms.migration
     2017-12-19 13:26:54,306 INFO  org.flywaydb.core.internal.dbsupport.DbSupportFactory @ Database: jdbc:postgresql://localhost:5432/dspacetest (PostgreSQL 9.5)
    -
  • - -
  • Let’s try again, but this time explicitly blank the PostgreSQL connection parameters in dspace.cfg and see if DSpace starts…

  • - -
  • Wow, ok, that works, but having to copy the PostgreSQL JDBC JAR to Tomcat’s lib folder totally blows

  • - -
  • Also, it’s likely this is only a problem on my local macOS + Tomcat test environment

  • - -
  • Ubuntu’s Tomcat distribution will probably handle this differently

  • - -
  • So for reference I have:

    - + +
  • +
  • After adding the Resource to server.xml on Ubuntu I get this in Catalina's logs:
  • +
    SEVERE: Unable to create initial connections of pool.
     java.sql.SQLException: org.postgresql.Driver
     ...
     Caused by: java.lang.ClassNotFoundException: org.postgresql.Driver
    -
    - -
  • The username and password are correct, but maybe I need to copy the fucking lib there too?

  • - -
  • I tried installing Ubuntu’s libpostgresql-jdbc-java package but Tomcat still can’t find the class

  • - -
  • Let me try to symlink the lib into Tomcat’s libs:

    - +
    # ln -sv /usr/share/java/postgresql.jar /usr/share/tomcat7/lib
    -
  • - -
  • Now Tomcat starts but the localhost container has errors:

    - +
    SEVERE: Exception sending context initialized event to listener instance of class org.dspace.app.util.DSpaceContextListener
     java.lang.AbstractMethodError: Method org/postgresql/jdbc3/Jdbc3ResultSet.isClosed()Z is abstract
    -
  • - -
  • Could be a version issue or something since the Ubuntu package provides 9.2 and DSpace’s are 9.1…

  • - -
  • Let me try to remove it and copy in DSpace’s:

    - +
    # rm /usr/share/tomcat7/lib/postgresql.jar
     # cp [dspace]/webapps/xmlui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar /usr/share/tomcat7/lib/
    -
  • - -
  • Wow, I think that actually works…

  • - -
  • I wonder if I could get the JDBC driver from postgresql.org instead of relying on the one from the DSpace build: https://jdbc.postgresql.org/

  • - -
  • I notice our version is 9.1-901, which isn’t even available anymore! The latest in the archived versions is 9.1-903

  • - -
  • Also, since I commented out all the db parameters in DSpace.cfg, how does the command line dspace tool work?

  • - -
  • Let’s try the upstream JDBC driver first:

    - +
    # rm /usr/share/tomcat7/lib/postgresql-9.1-901-1.jdbc4.jar
     # wget https://jdbc.postgresql.org/download/postgresql-42.1.4.jar -O /usr/share/tomcat7/lib/postgresql-42.1.4.jar
    -
  • - -
  • DSpace command line fails unless db settings are present in dspace.cfg:

    - +
    $ dspace database info
     Caught exception:
     java.sql.SQLException: java.lang.ClassNotFoundException: 
    -    at org.dspace.storage.rdbms.DataSourceInit.getDatasource(DataSourceInit.java:171)
    -    at org.dspace.storage.rdbms.DatabaseManager.initDataSource(DatabaseManager.java:1438)
    -    at org.dspace.storage.rdbms.DatabaseUtils.main(DatabaseUtils.java:81)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -    at java.lang.reflect.Method.invoke(Method.java:498)
    -    at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    -    at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    +        at org.dspace.storage.rdbms.DataSourceInit.getDatasource(DataSourceInit.java:171)
    +        at org.dspace.storage.rdbms.DatabaseManager.initDataSource(DatabaseManager.java:1438)
    +        at org.dspace.storage.rdbms.DatabaseUtils.main(DatabaseUtils.java:81)
    +        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    +        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    +        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    +        at java.lang.reflect.Method.invoke(Method.java:498)
    +        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    +        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
     Caused by: java.lang.ClassNotFoundException: 
    -    at java.lang.Class.forName0(Native Method)
    -    at java.lang.Class.forName(Class.java:264)
    -    at org.dspace.storage.rdbms.DataSourceInit.getDatasource(DataSourceInit.java:41)
    -    ... 8 more
    -
  • - -
  • And in the logs:

    - + at java.lang.Class.forName0(Native Method) + at java.lang.Class.forName(Class.java:264) + at org.dspace.storage.rdbms.DataSourceInit.getDatasource(DataSourceInit.java:41) + ... 8 more +
    2017-12-19 18:26:56,971 ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspace
     javax.naming.NoInitialContextException: Need to specify class name in environment or system property, or as an applet parameter, or in an application resource file:  java.naming.factory.initial
    -    at javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:662)
    -    at javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:313)
    -    at javax.naming.InitialContext.getURLOrDefaultInitCtx(InitialContext.java:350)
    -    at javax.naming.InitialContext.lookup(InitialContext.java:417)
    -    at org.dspace.storage.rdbms.DatabaseManager.initDataSource(DatabaseManager.java:1413)
    -    at org.dspace.storage.rdbms.DatabaseUtils.main(DatabaseUtils.java:81)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -    at java.lang.reflect.Method.invoke(Method.java:498)
    -    at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    -    at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    +        at javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:662)
    +        at javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:313)
    +        at javax.naming.InitialContext.getURLOrDefaultInitCtx(InitialContext.java:350)
    +        at javax.naming.InitialContext.lookup(InitialContext.java:417)
    +        at org.dspace.storage.rdbms.DatabaseManager.initDataSource(DatabaseManager.java:1413)
    +        at org.dspace.storage.rdbms.DatabaseUtils.main(DatabaseUtils.java:81)
    +        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    +        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    +        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    +        at java.lang.reflect.Method.invoke(Method.java:498)
    +        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    +        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
     2017-12-19 18:26:56,983 INFO  org.dspace.storage.rdbms.DatabaseManager @ Unable to locate JNDI dataSource: jdbc/dspace
     2017-12-19 18:26:56,983 INFO  org.dspace.storage.rdbms.DatabaseManager @ Falling back to creating own Database pool
     2017-12-19 18:26:56,992 WARN  org.dspace.core.ConfigurationManager @ Warning: Number format error in property: db.maxconnections
     2017-12-19 18:26:56,992 WARN  org.dspace.core.ConfigurationManager @ Warning: Number format error in property: db.maxwait
     2017-12-19 18:26:56,993 WARN  org.dspace.core.ConfigurationManager @ Warning: Number format error in property: db.maxidle
    -
  • - -
  • If I add the db values back to dspace.cfg the dspace database info command succeeds but the log still shows errors retrieving the JNDI connection

  • - -
  • Perhaps something to report to the dspace-tech mailing list when I finally send my comments

  • - -
  • Oh cool! select * from pg_stat_activity shows “PostgreSQL JDBC Driver” for the application name! That’s how you know it’s working!

  • - -
  • If you monitor the pg_stat_activity while you run dspace database info you can see that it doesn’t use the JNDI and creates ~9 extra PostgreSQL connections!

  • - -
  • And in the middle of all of this Linode sends an alert that CGSpace has high CPU usage from 2 to 4 PM

  • + - -

    2017-12-20

    - +

    2017-12-20

    - -

    PostgreSQL connection pooling on DSpace Test

    - +

    PostgreSQL connection pooling on DSpace Test

    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ dspace import -a -e aorth@mjanja.ch -s /home/aorth/cg_system_20Dec/SimpleArchiveFormat -m systemoffice.map &> systemoffice.log
    -
    - -
  • The fucking database went from 47 to 72 to 121 connections while I was importing so it stalled.

  • - -
  • Since I had to restart Tomcat anyways, I decided to just deploy the new JNDI connection pooling stuff on CGSpace

  • - -
  • There was an initial connection storm of 50 PostgreSQL connections, but then it settled down to 7

  • - -
  • After that CGSpace came up fine and I was able to import the 13 items just fine:

    - +
    $ dspace import -a -e aorth@mjanja.ch -s /home/aorth/cg_system_20Dec/SimpleArchiveFormat -m systemoffice.map &> systemoffice.log
     $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -i 10568/89287
    -
  • - -
  • The final code for the JNDI work in the Ansible infrastructure scripts is here: https://github.com/ilri/rmg-ansible-public/commit/1959d9cb7a0e7a7318c77f769253e5e029bdfa3b

  • + - -

    2017-12-24

    - +

    2017-12-24

    # find /var/log/nginx -type f -newermt "2017-12-01" | xargs zcat --force | goaccess --log-format=COMBINED -
    -
    - -
  • I can see interesting things using this approach, for example:

    - + - -

    2017-12-25

    - +
  • + +

    2017-12-25

    - -

    CGSpace PostgreSQL connections week

    - -

    2017-12-29

    - +

    CGSpace PostgreSQL connections week

    +

    2017-12-29

    # update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
     UPDATE 5
     # delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
    @@ -871,55 +718,43 @@ UPDATE 1
     UPDATE 5
     # delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)';
     DELETE 20
    -
    - -
  • I need to figure out why we have records with language in because that’s not a language!

  • + - -

    2017-12-30

    - +

    2017-12-30

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "30/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -637 207.46.13.106
    -641 157.55.39.186
    -715 68.180.229.254
    -924 104.196.152.243
    -1012 66.249.64.95
    -1060 216.244.66.245
    -1120 54.175.208.220
    -1287 66.249.64.93
    -1586 66.249.64.78
    -3653 66.249.64.91
    -
    - -
  • Looks pretty normal actually, but I don’t know who 54.175.208.220 is

  • - -
  • They identify as “com.plumanalytics”, which Google says is associated with Elsevier

  • - -
  • They only seem to have used one Tomcat session so that’s good, I guess I don’t need to add them to the Tomcat Crawler Session Manager valve:

    - + 637 207.46.13.106 + 641 157.55.39.186 + 715 68.180.229.254 + 924 104.196.152.243 + 1012 66.249.64.95 + 1060 216.244.66.245 + 1120 54.175.208.220 + 1287 66.249.64.93 + 1586 66.249.64.78 + 3653 66.249.64.91 +
    $ grep 54.175.208.220 dspace.log.2017-12-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l          
     1 
    -
  • - -
  • 216.244.66.245 seems to be moz.com’s DotBot

  • + - -

    2017-12-31

    - +

    2017-12-31

    +
    $ dspace import -a -e aorth@mjanja.ch -s /home/aorth/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat -m ccafs.map &> ccafs.log
    +
    diff --git a/docs/2018-01/index.html b/docs/2018-01/index.html index 87b2b27f7..0b704657c 100644 --- a/docs/2018-01/index.html +++ b/docs/2018-01/index.html @@ -8,29 +8,22 @@ @@ -90,29 +82,22 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv - + @@ -244,33 +228,26 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv

    -

    2018-01-02

    - +

    2018-01-02

    Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
    -
    - -
  • Ah hah! So the pool was actually empty!

  • - -
  • I need to increase that, let’s try to bump it up from 50 to 75

  • - -
  • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw

  • - -
  • I notice this error quite a few times in dspace.log:

    - +
    2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
     org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.
    -
  • - -
  • And there are many of these errors every day for the past month:

    - +
    $ grep -c "Error while searching for sidebar facets" dspace.log.*
     dspace.log.2017-11-21:4
     dspace.log.2017-11-22:1
    @@ -315,419 +292,343 @@ dspace.log.2017-12-30:89
     dspace.log.2017-12-31:53
     dspace.log.2018-01-01:45
     dspace.log.2018-01-02:34
    -
  • - -
  • Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains

  • + - -

    2018-01-03

    - +

    2018-01-03

    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
     dspace.log.2018-01-01:0
     dspace.log.2018-01-02:1972
     dspace.log.2018-01-03:1909
    -
    - -
  • For some reason there were a lot of “active” connections last night:

  • + - -

    CGSpace PostgreSQL connections

    - +

    CGSpace PostgreSQL connections

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "3/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -607 40.77.167.141
    -611 2a00:23c3:8c94:7800:392c:a491:e796:9c50
    -663 188.226.169.37
    -759 157.55.39.245
    -887 68.180.229.254
    -1037 157.55.39.175
    -1068 216.244.66.245
    -1495 66.249.64.91
    -1934 104.196.152.243
    -2219 134.155.96.78
    -
    - -
  • 134.155.96.78 appears to be at the University of Mannheim in Germany

  • - -
  • They identify as: Mozilla/5.0 (compatible; heritrix/3.2.0 +http://ifm.uni-mannheim.de)

  • - -
  • This appears to be the Internet Archive’s open source bot

  • - -
  • They seem to be re-using their Tomcat session so I don’t need to do anything to them just yet:

    - + 607 40.77.167.141 + 611 2a00:23c3:8c94:7800:392c:a491:e796:9c50 + 663 188.226.169.37 + 759 157.55.39.245 + 887 68.180.229.254 + 1037 157.55.39.175 + 1068 216.244.66.245 + 1495 66.249.64.91 + 1934 104.196.152.243 + 2219 134.155.96.78 +
    $ grep 134.155.96.78 dspace.log.2018-01-03 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     2
    -
  • - -
  • The API logs show the normal users:

    - +
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "3/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    - 32 207.46.13.182
    - 38 40.77.167.132
    - 38 68.180.229.254
    - 43 66.249.64.91
    - 46 40.77.167.141
    - 49 157.55.39.245
    - 79 157.55.39.175
    -1533 50.116.102.77
    -4069 70.32.83.92
    -9355 45.5.184.196
    -
  • - -
  • In other related news I see a sizeable amount of requests coming from python-requests

  • - -
  • For example, just in the last day there were 1700!

    - + 32 207.46.13.182 + 38 40.77.167.132 + 38 68.180.229.254 + 43 66.249.64.91 + 46 40.77.167.141 + 49 157.55.39.245 + 79 157.55.39.175 + 1533 50.116.102.77 + 4069 70.32.83.92 + 9355 45.5.184.196 +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -c python-requests
     1773
    -
  • - -
  • But they come from hundreds of IPs, many of which are 54.x.x.x:

    - -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep python-requests | awk '{print $1}' | sort -n | uniq -c | sort -h | tail -n 30
    -  9 54.144.87.92
    -  9 54.146.222.143
    -  9 54.146.249.249
    -  9 54.158.139.206
    -  9 54.161.235.224
    -  9 54.163.41.19
    -  9 54.163.4.51
    -  9 54.196.195.107
    -  9 54.198.89.134
    -  9 54.80.158.113
    - 10 54.198.171.98
    - 10 54.224.53.185
    - 10 54.226.55.207
    - 10 54.227.8.195
    - 10 54.242.234.189
    - 10 54.242.238.209
    - 10 54.80.100.66
    - 11 54.161.243.121
    - 11 54.205.154.178
    - 11 54.234.225.84
    - 11 54.87.23.173
    - 11 54.90.206.30
    - 12 54.196.127.62
    - 12 54.224.242.208
    - 12 54.226.199.163
    - 13 54.162.149.249
    - 13 54.211.182.255
    - 19 50.17.61.150
    - 21 54.211.119.107
    -139 164.39.7.62
    -
  • - -
  • I have no idea what these are but they seem to be coming from Amazon…

  • - -
  • I guess for now I just have to increase the database connection pool’s max active

  • - -
  • It’s currently 75 and normally I’d just bump it by 25 but let me be a bit daring and push it by 50 to 125, because I used to see at least 121 connections in pg_stat_activity before when we were using the shitty default pooling

  • + - -

    2018-01-04

    - +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep python-requests | awk '{print $1}' | sort -n | uniq -c | sort -h | tail -n 30
    +      9 54.144.87.92
    +      9 54.146.222.143
    +      9 54.146.249.249
    +      9 54.158.139.206
    +      9 54.161.235.224
    +      9 54.163.41.19
    +      9 54.163.4.51
    +      9 54.196.195.107
    +      9 54.198.89.134
    +      9 54.80.158.113
    +     10 54.198.171.98
    +     10 54.224.53.185
    +     10 54.226.55.207
    +     10 54.227.8.195
    +     10 54.242.234.189
    +     10 54.242.238.209
    +     10 54.80.100.66
    +     11 54.161.243.121
    +     11 54.205.154.178
    +     11 54.234.225.84
    +     11 54.87.23.173
    +     11 54.90.206.30
    +     12 54.196.127.62
    +     12 54.224.242.208
    +     12 54.226.199.163
    +     13 54.162.149.249
    +     13 54.211.182.255
    +     19 50.17.61.150
    +     21 54.211.119.107
    +    139 164.39.7.62
    +
    +

    2018-01-04

    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "4/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -968 197.211.63.81
    -981 213.55.99.121
    -1039 66.249.64.93
    -1258 157.55.39.175
    -1273 207.46.13.182
    -1311 157.55.39.191
    -1319 157.55.39.197
    -1775 66.249.64.78
    -2216 104.196.152.243
    -3366 66.249.64.91
    -
    - -
  • Again we ran out of PostgreSQL database connections, even after bumping the pool max active limit from 50 to 75 to 125 yesterday!

    - + 968 197.211.63.81 + 981 213.55.99.121 + 1039 66.249.64.93 + 1258 157.55.39.175 + 1273 207.46.13.182 + 1311 157.55.39.191 + 1319 157.55.39.197 + 1775 66.249.64.78 + 2216 104.196.152.243 + 3366 66.249.64.91 +
    2018-01-04 07:36:08,089 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
     org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-256] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:125; busy:125; idle:0; lastwait:5000].
    -
  • - -
  • So for this week that is the number one problem!

    - +
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
     dspace.log.2018-01-01:0
     dspace.log.2018-01-02:1972
     dspace.log.2018-01-03:1909
     dspace.log.2018-01-04:1559
    -
  • - -
  • I will just bump the connection limit to 300 because I’m fucking fed up with this shit

  • - -
  • Once I get back to Amman I will have to try to create different database pools for different web applications, like recently discussed on the dspace-tech mailing list

  • - -
  • Create accounts on CGSpace for two CTA staff km4ard@cta.int and bheenick@cta.int

  • + - -

    2018-01-05

    - +

    2018-01-05

    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
     dspace.log.2018-01-01:0
     dspace.log.2018-01-02:1972
     dspace.log.2018-01-03:1909
     dspace.log.2018-01-04:1559
     dspace.log.2018-01-05:0
    -
    - -
  • Daniel asked for help with their DAGRIS server (linode2328112) that has no disk space

  • - -
  • I had a look and there is one Apache 2 log file that is 73GB, with lots of this:

    - +
    [Fri Jan 05 09:31:22.965398 2018] [:error] [pid 9340] [client 213.55.99.121:64476] WARNING: Unable to find a match for "9-16-1-RV.doc" in "/home/files/journals/6//articles/9/". Skipping this file., referer: http://dagris.info/reviewtool/index.php/index/install/upgrade
    -
  • - -
  • I will delete the log file for now and tell Danny

  • - -
  • Also, I’m still seeing a hundred or so of the “ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer” errors in dspace logs, I need to search the dspace-tech mailing list to see what the cause is

  • - -
  • I will run a full Discovery reindex in the mean time to see if it’s something wrong with the Discovery Solr core

    - +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
     
     real    110m43.985s
     user    15m24.960s
     sys     3m14.890s
    -
  • - -
  • Reboot CGSpace and DSpace Test for new kernels (4.14.12-x86_64-linode92) that partially mitigate the Spectre and Meltdown CPU vulnerabilities

  • + - -

    2018-01-06

    - +

    2018-01-06

    - -

    2018-01-09

    - +
    org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1983+TO+1989]': Encountered " "]" "] "" at line 1, column 32.
    +
    +

    2018-01-09

    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
     COPY 4515
    -
    - - -

    2018-01-10

    - +

    2018-01-10

    Moving: 81742 into core statistics-2010
     Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2010
     org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2010
    -    at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
    -    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
    -    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
    -    at org.dspace.statistics.SolrLogger.shardSolrIndex(SourceFile:2243)
    -    at org.dspace.statistics.util.StatisticsClient.main(StatisticsClient.java:106)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -    at java.lang.reflect.Method.invoke(Method.java:498)
    -    at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    -    at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    +        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
    +        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
    +        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
    +        at org.dspace.statistics.SolrLogger.shardSolrIndex(SourceFile:2243)
    +        at org.dspace.statistics.util.StatisticsClient.main(StatisticsClient.java:106)
    +        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    +        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    +        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    +        at java.lang.reflect.Method.invoke(Method.java:498)
    +        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    +        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
     Caused by: org.apache.http.client.ClientProtocolException
    -    at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:867)
    -    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
    -    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
    -    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
    -    at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448)
    -    ... 10 more
    +        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:867)
    +        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
    +        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
    +        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
    +        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448)
    +        ... 10 more
     Caused by: org.apache.http.client.NonRepeatableRequestException: Cannot retry request with a non-repeatable request entity.  The cause lists the reason the original request failed.
    -    at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:659)
    -    at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
    -    at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
    -    ... 14 more
    +        at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:659)
    +        at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
    +        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
    +        ... 14 more
     Caused by: java.net.SocketException: Connection reset
    -    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:115)
    -    at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
    -    at org.apache.http.impl.io.AbstractSessionOutputBuffer.flushBuffer(AbstractSessionOutputBuffer.java:159)
    -    at org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:179)
    -    at org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:124)
    -    at org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:181)
    -    at org.apache.http.entity.InputStreamEntity.writeTo(InputStreamEntity.java:132)
    -    at org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:89)
    -    at org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108)
    -    at org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:117)
    -    at org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:265)
    -    at org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:203)
    -    at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:236)
    -    at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121)
    -    at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
    -    ... 16 more
    -
    - -
  • DSpace Test has the same error but with creating the 2017 core:

    - + at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:115) + at java.net.SocketOutputStream.write(SocketOutputStream.java:155) + at org.apache.http.impl.io.AbstractSessionOutputBuffer.flushBuffer(AbstractSessionOutputBuffer.java:159) + at org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:179) + at org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:124) + at org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:181) + at org.apache.http.entity.InputStreamEntity.writeTo(InputStreamEntity.java:132) + at org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:89) + at org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108) + at org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:117) + at org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:265) + at org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:203) + at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:236) + at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121) + at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685) + ... 16 more +
    Moving: 2243021 into core statistics-2017
     Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2017
     org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2017
    -    at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
    -    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
    -    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
    -    at org.dspace.statistics.SolrLogger.shardSolrIndex(SourceFile:2243)
    -    at org.dspace.statistics.util.StatisticsClient.main(StatisticsClient.java:106)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -    at java.lang.reflect.Method.invoke(Method.java:498)
    -    at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    -    at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    +        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
    +        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
    +        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
    +        at org.dspace.statistics.SolrLogger.shardSolrIndex(SourceFile:2243)
    +        at org.dspace.statistics.util.StatisticsClient.main(StatisticsClient.java:106)
    +        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    +        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    +        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    +        at java.lang.reflect.Method.invoke(Method.java:498)
    +        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    +        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
     Caused by: org.apache.http.client.ClientProtocolException
    -    at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:867)
    -    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
    -    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
    -    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
    -    at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448)
    -    ... 10 more
    -
  • - -
  • There is interesting documentation about this on the DSpace Wiki: https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-SolrShardingByYear

  • - -
  • I’m looking to see maybe if we’re hitting the issues mentioned in DS-2212 that were apparently fixed in DSpace 5.2

  • - -
  • I can apparently search for records in the Solr stats core that have an empty owningColl field using this in the Solr admin query: -owningColl:*

  • - -
  • On CGSpace I see 48,000,000 records that have an owningColl field and 34,000,000 that don’t:

    - + at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:867) + at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) + at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) + at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) + at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448) + ... 10 more +
    $ http 'http://localhost:3000/solr/statistics/select?q=owningColl%3A*&wt=json&indent=true' | grep numFound 
    -"response":{"numFound":48476327,"start":0,"docs":[
    +  "response":{"numFound":48476327,"start":0,"docs":[
     $ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&wt=json&indent=true' | grep numFound
    -"response":{"numFound":34879872,"start":0,"docs":[
    -
  • - -
  • I tested the dspace stats-util -s process on my local machine and it failed the same way

  • - -
  • It doesn’t seem to be helpful, but the dspace log shows this:

    - + "response":{"numFound":34879872,"start":0,"docs":[ +
    2018-01-10 10:51:19,301 INFO  org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
     2018-01-10 10:51:19,301 INFO  org.dspace.statistics.SolrLogger @ Moving: 3821 records into core statistics-2016
    -
  • - -
  • Terry Brady has written some notes on the DSpace Wiki about Solr sharing issues: https://wiki.duraspace.org/display/%7Eterrywbrady/Statistics+Import+Export+Issues

  • - -
  • Uptime Robot said that CGSpace went down at around 9:43 AM

  • - -
  • I looked at PostgreSQL’s pg_stat_activity table and saw 161 active connections, but no pool errors in the DSpace logs:

    - +
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-10 
     0
    -
  • - -
  • The XMLUI logs show quite a bit of activity today:

    - +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep "10/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -951 207.46.13.159
    -954 157.55.39.123
    -1217 95.108.181.88
    -1503 104.196.152.243
    -6455 70.36.107.50
    -11412 70.36.107.190
    -16730 70.36.107.49
    -17386 2607:fa98:40:9:26b6:fdff:feff:1c96
    -21566 2607:fa98:40:9:26b6:fdff:feff:195d
    -45384 2607:fa98:40:9:26b6:fdff:feff:1888
    -
  • - -
  • The user agent for the top six or so IPs are all the same:

    - + 951 207.46.13.159 + 954 157.55.39.123 + 1217 95.108.181.88 + 1503 104.196.152.243 + 6455 70.36.107.50 + 11412 70.36.107.190 + 16730 70.36.107.49 + 17386 2607:fa98:40:9:26b6:fdff:feff:1c96 + 21566 2607:fa98:40:9:26b6:fdff:feff:195d + 45384 2607:fa98:40:9:26b6:fdff:feff:1888 +
    "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36"
    -
  • - -
  • whois says they come from Perfect IP

  • - -
  • I’ve never seen those top IPs before, but they have created 50,000 Tomcat sessions today:

    - +
    $ grep -E '(2607:fa98:40:9:26b6:fdff:feff:1888|2607:fa98:40:9:26b6:fdff:feff:195d|2607:fa98:40:9:26b6:fdff:feff:1c96|70.36.107.49|70.36.107.190|70.36.107.50)' /home/cgspace.cgiar.org/log/dspace.log.2018-01-10 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l                                                                                                                                                                                                  
     49096
    -
  • - -
  • Rather than blocking their IPs, I think I might just add their user agent to the “badbots” zone with Baidu, because they seem to be the only ones using that user agent:

    - +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari
     /537.36" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -6796 70.36.107.50
    -11870 70.36.107.190
    -17323 70.36.107.49
    -19204 2607:fa98:40:9:26b6:fdff:feff:1c96
    -23401 2607:fa98:40:9:26b6:fdff:feff:195d 
    -47875 2607:fa98:40:9:26b6:fdff:feff:1888
    -
  • - -
  • I added the user agent to nginx’s badbots limit req zone but upon testing the config I got an error:

    - + 6796 70.36.107.50 + 11870 70.36.107.190 + 17323 70.36.107.49 + 19204 2607:fa98:40:9:26b6:fdff:feff:1c96 + 23401 2607:fa98:40:9:26b6:fdff:feff:195d + 47875 2607:fa98:40:9:26b6:fdff:feff:1888 +
    # nginx -t
     nginx: [emerg] could not build map_hash, you should increase map_hash_bucket_size: 64
     nginx: configuration file /etc/nginx/nginx.conf test failed
    -
  • - -
  • According to nginx docs the bucket size should be a multiple of the CPU’s cache alignment, which is 64 for us:

    - +
    # cat /proc/cpuinfo | grep cache_alignment | head -n1
     cache_alignment : 64
    -
  • - -
  • On our servers that is 64, so I increased this parameter to 128 and deployed the changes to nginx

  • - -
  • Almost immediately the PostgreSQL connections dropped back down to 40 or so, and UptimeRobot said the site was back up

  • - -
  • So that’s interesting that we’re not out of PostgreSQL connections (current pool maxActive is 300!) but the system is “down” to UptimeRobot and very slow to use

  • - -
  • Linode continues to test mitigations for Meltdown and Spectre: https://blog.linode.com/2018/01/03/cpu-vulnerabilities-meltdown-spectre/

  • - -
  • I rebooted DSpace Test to see if the kernel will be updated (currently Linux 4.14.12-x86_64-linode92)… nope.

  • - -
  • It looks like Linode will reboot the KVM hosts later this week, though

  • - -
  • Udana from WLE asked if we could give him permission to upload CSVs to CGSpace (which would require super admin access)

  • - -
  • Citing concerns with metadata quality, I suggested adding him on DSpace Test first

  • - -
  • I opened a ticket with Atmire to ask them about DSpace 5.8 compatibility: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560

  • + - -

    2018-01-11

    - +

    2018-01-11

    - -

    PostgreSQL load -Firewall load

    - +

    PostgreSQL load +Firewall load

    127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/statistics/select?q=type%3A2+AND+id%3A1&wt=javabin&version=2 HTTP/1.1" 200 107
     127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/statistics/select?q=*%3A*&rows=0&facet=true&facet.range=time&facet.range.start=NOW%2FYEAR-18YEARS&facet.range.end=NOW%2FYEAR%2B0YEARS&facet.range.gap=%2B1YEAR&facet.mincount=1&wt=javabin&version=2 HTTP/1.1" 200 447
     127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/admin/cores?action=STATUS&core=statistics-2016&indexInfo=true&wt=javabin&version=2 HTTP/1.1" 200 76
    @@ -735,158 +636,123 @@ cache_alignment : 64
     127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/statistics/select?csv.mv.separator=%7C&q=*%3A*&fq=time%3A%28%5B2016%5C-01%5C-01T00%5C%3A00%5C%3A00Z+TO+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%5D+NOT+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%29&rows=10000&wt=csv HTTP/1.1" 200 2137630
     127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/statistics/admin/luke?show=schema&wt=javabin&version=2 HTTP/1.1" 200 16253
     127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "POST /solr//statistics-2016/update/csv?commit=true&softCommit=false&waitSearcher=true&f.previousWorkflowStep.split=true&f.previousWorkflowStep.separator=%7C&f.previousWorkflowStep.encapsulator=%22&f.actingGroupId.split=true&f.actingGroupId.separator=%7C&f.actingGroupId.encapsulator=%22&f.containerCommunity.split=true&f.containerCommunity.separator=%7C&f.containerCommunity.encapsulator=%22&f.range.split=true&f.range.separator=%7C&f.range.encapsulator=%22&f.containerItem.split=true&f.containerItem.separator=%7C&f.containerItem.encapsulator=%22&f.p_communities_map.split=true&f.p_communities_map.separator=%7C&f.p_communities_map.encapsulator=%22&f.ngram_query_search.split=true&f.ngram_query_search.separator=%7C&f.ngram_query_search.encapsulator=%22&f.containerBitstream.split=true&f.containerBitstream.separator=%7C&f.containerBitstream.encapsulator=%22&f.owningItem.split=true&f.owningItem.separator=%7C&f.owningItem.encapsulator=%22&f.actingGroupParentId.split=true&f.actingGroupParentId.separator=%7C&f.actingGroupParentId.encapsulator=%22&f.text.split=true&f.text.separator=%7C&f.text.encapsulator=%22&f.simple_query_search.split=true&f.simple_query_search.separator=%7C&f.simple_query_search.encapsulator=%22&f.owningComm.split=true&f.owningComm.separator=%7C&f.owningComm.encapsulator=%22&f.owner.split=true&f.owner.separator=%7C&f.owner.encapsulator=%22&f.filterquery.split=true&f.filterquery.separator=%7C&f.filterquery.encapsulator=%22&f.p_group_map.split=true&f.p_group_map.separator=%7C&f.p_group_map.encapsulator=%22&f.actorMemberGroupId.split=true&f.actorMemberGroupId.separator=%7C&f.actorMemberGroupId.encapsulator=%22&f.bitstreamId.split=true&f.bitstreamId.separator=%7C&f.bitstreamId.encapsulator=%22&f.group_name.split=true&f.group_name.separator=%7C&f.group_name.encapsulator=%22&f.p_communities_name.split=true&f.p_communities_name.separator=%7C&f.p_communities_name.encapsulator=%22&f.query.split=true&f.query.separator=%7C&f.query.encapsulator=%22&f.workflowStep.split=true&f.workflowStep.separator=%7C&f.workflowStep.encapsulator=%22&f.containerCollection.split=true&f.containerCollection.separator=%7C&f.containerCollection.encapsulator=%22&f.complete_query_search.split=true&f.complete_query_search.separator=%7C&f.complete_query_search.encapsulator=%22&f.p_communities_id.split=true&f.p_communities_id.separator=%7C&f.p_communities_id.encapsulator=%22&f.rangeDescription.split=true&f.rangeDescription.separator=%7C&f.rangeDescription.encapsulator=%22&f.group_id.split=true&f.group_id.separator=%7C&f.group_id.encapsulator=%22&f.bundleName.split=true&f.bundleName.separator=%7C&f.bundleName.encapsulator=%22&f.ngram_simplequery_search.split=true&f.ngram_simplequery_search.separator=%7C&f.ngram_simplequery_search.encapsulator=%22&f.group_map.split=true&f.group_map.separator=%7C&f.group_map.encapsulator=%22&f.owningColl.split=true&f.owningColl.separator=%7C&f.owningColl.encapsulator=%22&f.p_group_id.split=true&f.p_group_id.separator=%7C&f.p_group_id.encapsulator=%22&f.p_group_name.split=true&f.p_group_name.separator=%7C&f.p_group_name.encapsulator=%22&wt=javabin&version=2 HTTP/1.1" 409 156
    -
    - -
  • The new core is created but when DSpace attempts to POST to it there is an HTTP 409 error

  • - -
  • This is apparently a common Solr error code that means “version conflict”: http://yonik.com/solr/optimistic-concurrency/

  • - -
  • Looks like that bot from the PerfectIP.net host ended up making about 450,000 requests to XMLUI alone yesterday:

    - -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36" | grep "10/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -21572 70.36.107.50
    -30722 70.36.107.190
    -34566 70.36.107.49
    -101829 2607:fa98:40:9:26b6:fdff:feff:195d
    -111535 2607:fa98:40:9:26b6:fdff:feff:1c96
    -161797 2607:fa98:40:9:26b6:fdff:feff:1888
    -
  • - -
  • Wow, I just figured out how to set the application name of each database pool in the JNDI config of Tomcat’s server.xml:

    - -
    <Resource name="jdbc/dspaceWeb" auth="Container" type="javax.sql.DataSource"
    -      driverClassName="org.postgresql.Driver"
    -      url="jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceWeb"
    -      username="dspace"
    -      password="dspace"
    -      initialSize='5'
    -      maxActive='75'
    -      maxIdle='15'
    -      minIdle='5'
    -      maxWait='5000'
    -      validationQuery='SELECT 1'
    -      testOnBorrow='true' />
    -
  • - -
  • So theoretically I could name each connection “xmlui” or “dspaceWeb” or something meaningful and it would show up in PostgreSQL’s pg_stat_activity table!

  • - -
  • This would be super helpful for figuring out where load was coming from (now I wonder if I could figure out how to graph this)

  • - -
  • Also, I realized that the db.jndi parameter in dspace.cfg needs to match the name value in your applicaiton’s context—not the global one

  • - -
  • Ah hah! Also, I can name the default DSpace connection pool in dspace.cfg as well, like:

    - -
    db.url = jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceDefault
    -
  • - -
  • With that it is super easy to see where PostgreSQL connections are coming from in pg_stat_activity

  • + - -

    2018-01-12

    - +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36" | grep "10/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +  21572 70.36.107.50
    +  30722 70.36.107.190
    +  34566 70.36.107.49
    + 101829 2607:fa98:40:9:26b6:fdff:feff:195d
    + 111535 2607:fa98:40:9:26b6:fdff:feff:1c96
    + 161797 2607:fa98:40:9:26b6:fdff:feff:1888
    +
    +
    <Resource name="jdbc/dspaceWeb" auth="Container" type="javax.sql.DataSource"
    +          driverClassName="org.postgresql.Driver"
    +          url="jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceWeb"
    +          username="dspace"
    +          password="dspace"
    +          initialSize='5'
    +          maxActive='75'
    +          maxIdle='15'
    +          minIdle='5'
    +          maxWait='5000'
    +          validationQuery='SELECT 1'
    +          testOnBorrow='true' />
    +
    +
    db.url = jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceDefault
    +
    +

    2018-01-12

    <!-- Define a non-SSL HTTP/1.1 Connector on port 8080 -->
     <Connector port="8080"
    -       maxThreads="150"
    -       minSpareThreads="25"
    -       maxSpareThreads="75"
    -       enableLookups="false"
    -       redirectPort="8443"
    -       acceptCount="100"
    -       connectionTimeout="20000"
    -       disableUploadTimeout="true"
    -       URIEncoding="UTF-8"/>
    -
    - -
  • In Tomcat 8.5 the maxThreads defaults to 200 which is probably fine, but tweaking minSpareThreads could be good

  • - -
  • I don’t see a setting for maxSpareThreads in the docs so that might be an error

  • - -
  • Looks like in Tomcat 8.5 the default URIEncoding for Connectors is UTF-8, so we don’t need to specify that manually anymore: https://tomcat.apache.org/tomcat-8.5-doc/config/http.html

  • - -
  • Ooh, I just saw the acceptorThreadCount setting (in Tomcat 7 and 8.5):

    - -
    The number of threads to be used to accept connections. Increase this value on a multi CPU machine, although you would never really need more than 2. Also, with a lot of non keep alive connections, you might want to increase this value as well. Default value is 1.
    -
  • - -
  • That could be very interesting

  • + maxThreads="150" + minSpareThreads="25" + maxSpareThreads="75" + enableLookups="false" + redirectPort="8443" + acceptCount="100" + connectionTimeout="20000" + disableUploadTimeout="true" + URIEncoding="UTF-8"/> + - -

    2018-01-13

    - +
    The number of threads to be used to accept connections. Increase this value on a multi CPU machine, although you would never really need more than 2. Also, with a lot of non keep alive connections, you might want to increase this value as well. Default value is 1.
    +
    +

    2018-01-13

    13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxActive is not used in DBCP2, use maxTotal instead. maxTotal default value is 8. You have set value of "35" for "maxActive" property, which is being ignored.
     13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxWait is not used in DBCP2 , use maxWaitMillis instead. maxWaitMillis default value is -1. You have set value of "5000" for "maxWait" property, which is being ignored.
    -
    - -
  • I looked in my Tomcat 7.0.82 logs and I don’t see anything about DBCP2 errors, so I guess this a Tomcat 8.0.x or 8.5.x thing

  • - -
  • DBCP2 appears to be Tomcat 8.0.x and up according to the Tomcat 8.0 migration guide

  • - -
  • I have updated our Ansible infrastructure scripts so that it will be ready whenever we switch to Tomcat 8 (probably with Ubuntu 18.04 later this year)

  • - -
  • When I enable the ResourceLink in the ROOT.xml context I get the following error in the Tomcat localhost log:

    - -
    13-Jan-2018 14:14:36.017 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.app.util.DSpaceWebappListener]
    -java.lang.ExceptionInInitializerError
    -    at org.dspace.app.util.AbstractDSpaceWebapp.register(AbstractDSpaceWebapp.java:74)
    -    at org.dspace.app.util.DSpaceWebappListener.contextInitialized(DSpaceWebappListener.java:31)
    -    at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4745)
    -    at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5207)
    -    at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
    -    at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:752)
    -    at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:728)
    -    at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:734)
    -    at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:629)
    -    at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1839)
    -    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    -    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    -    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    -    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    -    at java.lang.Thread.run(Thread.java:748)
    -Caused by: java.lang.NullPointerException
    -    at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:547)
    -    at org.dspace.core.Context.<clinit>(Context.java:103)
    -    ... 15 more
    -
  • - -
  • Interesting blog post benchmarking Tomcat JDBC vs Apache Commons DBCP2, with configuration snippets: http://www.tugay.biz/2016/07/tomcat-connection-pool-vs-apache.html

  • - -
  • The Tomcat vs Apache pool thing is confusing, but apparently we’re using Apache Commons DBCP2 because we don’t specify factory="org.apache.tomcat.jdbc.pool.DataSourceFactory" in our global resource

  • - -
  • So at least I know that I’m not looking for documentation or troubleshooting on the Tomcat JDBC pool!

  • - -
  • I looked at pg_stat_activity during Tomcat’s startup and I see that the pool created in server.xml is indeed connecting, just that nothing uses it

  • - -
  • Also, the fallback connection parameters specified in local.cfg (not dspace.cfg) are used

  • - -
  • Shit, this might actually be a DSpace error: https://jira.duraspace.org/browse/DS-3434

  • - -
  • I’ll comment on that issue

  • + - -

    2018-01-14

    - +
    13-Jan-2018 14:14:36.017 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.app.util.DSpaceWebappListener]
    + java.lang.ExceptionInInitializerError
    +        at org.dspace.app.util.AbstractDSpaceWebapp.register(AbstractDSpaceWebapp.java:74)
    +        at org.dspace.app.util.DSpaceWebappListener.contextInitialized(DSpaceWebappListener.java:31)
    +        at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4745)
    +        at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5207)
    +        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
    +        at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:752)
    +        at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:728)
    +        at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:734)
    +        at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:629)
    +        at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1839)
    +        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    +        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    +        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    +        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    +        at java.lang.Thread.run(Thread.java:748)
    +Caused by: java.lang.NullPointerException
    +        at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:547)
    +        at org.dspace.core.Context.<clinit>(Context.java:103)
    +        ... 15 more
    +
    +

    2018-01-14

    - -

    2018-01-15

    - +

    2018-01-15

    update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
     delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
     update metadatavalue set text_value='en' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(En|English)';
    @@ -896,101 +762,79 @@ update metadatavalue set text_value='vi' where resource_type_id=2 and metadata_f
     update metadatavalue set text_value='ru' where resource_type_id=2 and metadata_field_id=38 and text_value='Ru';
     update metadatavalue set text_value='in' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(IN|In)';
     delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)';
    -
    - -
  • Continue proofing Peter’s author corrections that I started yesterday, faceting on non blank, non flagged, and briefly scrolling through the values of the corrections to find encoding errors for French and Spanish names

  • + - -

    OpenRefine Authors

    - +

    OpenRefine Authors

    $ ./fix-metadata-values.py -i /tmp/2018-01-14-Authors-1300-Corrections.csv -f dc.contributor.author -t correct -m 3 -d dspace-u dspace -p 'fuuu'
    -
    - -
  • In looking at some of the values to delete or check I found some metadata values that I could not resolve their handle via SQL:

    - +
    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Tarawali';
    -metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
    + metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
     -------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------
    -       2757936 |        4369 |                 3 | Tarawali   |           |     9 |           |        600 |                2
    +           2757936 |        4369 |                 3 | Tarawali   |           |     9 |           |        600 |                2
     (1 row)
     
     dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '4369';
    -handle
    + handle
     --------
     (0 rows)
    -
  • - -
  • Even searching in the DSpace advanced search for author equals “Tarawali” produces nothing…

  • - -
  • Otherwise, the DSpace 5 SQL Helper Functions provide ds5_item2itemhandle(), which is much easier than my long query above that I always have to go search for

  • - -
  • For example, to find the Handle for an item that has the author “Erni”:

    - +
    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Erni';
    -metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place |              authority               | confidence | resource_type_id 
    + metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place |              authority               | confidence | resource_type_id 
     -------------------+-------------+-------------------+------------+-----------+-------+--------------------------------------+------------+------------------
    -       2612150 |       70308 |                 3 | Erni       |           |     9 | 3fe10c68-6773-49a7-89cc-63eb508723f2 |         -1 |                2
    +           2612150 |       70308 |                 3 | Erni       |           |     9 | 3fe10c68-6773-49a7-89cc-63eb508723f2 |         -1 |                2
     (1 row)
     dspace=# select ds5_item2itemhandle(70308);
    -ds5_item2itemhandle 
    + ds5_item2itemhandle 
     ---------------------
    -10568/68609
    + 10568/68609
     (1 row)
    -
  • - -
  • Next I apply the author deletions:

    - +
    $ ./delete-metadata-values.py -i /tmp/2018-01-14-Authors-5-Deletions.csv -f dc.contributor.author -m 3 -d dspace -u dspace -p 'fuuu'
    -
  • - -
  • Now working on the affiliation corrections from Peter:

    - +
    $ ./fix-metadata-values.py -i /tmp/2018-01-15-Affiliations-888-Corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
     $ ./delete-metadata-values.py -i /tmp/2018-01-15-Affiliations-11-Deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
    -
  • - -
  • Now I made a new list of affiliations for Peter to look through:

    - +
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where metadata_schema_id = 2 and element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
     COPY 4552
    -
  • - -
  • Looking over the affiliations again I see dozens of CIAT ones with their affiliation formatted like: International Center for Tropical Agriculture (CIAT)

  • - -
  • For example, this one is from just last month: https://cgspace.cgiar.org/handle/10568/89930

  • - -
  • Our controlled vocabulary has this in the format without the abbreviation: International Center for Tropical Agriculture

  • - -
  • So some submitters don’t know to use the controlled vocabulary lookup

  • - -
  • Help Sisay with some thumbnails for book chapters in Open Refine and SAFBuilder

  • - -
  • CGSpace users were having problems logging in, I think something’s wrong with LDAP because I see this in the logs:

    - +
    2018-01-15 12:53:15,810 WARN  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=2386749547D03E0AA4EC7E44181A7552:ip_addr=x.x.x.x:ldap_authentication:type=failed_auth javax.naming.AuthenticationException\colon; [LDAP\colon; error code 49 - 80090308\colon; LdapErr\colon; DSID-0C090400, comment\colon; AcceptSecurityContext error, data 775, v1db1^@]
    -
  • - -
  • Looks like we processed 2.9 million requests on CGSpace in 2017-12:

    - +
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Dec/2017"
     2890041
     
     real    0m25.756s
     user    0m28.016s
     sys     0m2.210s
    -
  • - - -

    2018-01-16

    - +

    2018-01-16

    + +
  • I removed Tsega's SSH access to the web and DSpace servers, and asked Danny to check whether there is anything he needs from Tsega's home directories so we can delete the accounts completely
  • +
  • I removed Tsega's access to Linode dashboard as well
  • I ended up creating a Jira issue for my db.jndi documentation fix: DS-3803
  • The DSpace developers said they wanted each pull request to be associated with a Jira issue
  • - -

    2018-01-17

    - +

    2018-01-17

    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFormat -m lives.map &> lives.log
    -
    - -
  • And fantastic, before I started the import there were 10 PostgreSQL connections, and then CGSpace crashed during the upload

  • - -
  • When I looked there were 210 PostgreSQL connections!

  • - -
  • I don’t see any high load in XMLUI or REST/OAI:

    - +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "17/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    -381 40.77.167.124
    -403 213.55.99.121
    -431 207.46.13.60
    -445 157.55.39.113
    -445 157.55.39.231
    -449 95.108.181.88
    -453 68.180.229.254
    -593 54.91.48.104
    -757 104.196.152.243
    -776 66.249.66.90
    +    381 40.77.167.124
    +    403 213.55.99.121
    +    431 207.46.13.60
    +    445 157.55.39.113
    +    445 157.55.39.231
    +    449 95.108.181.88
    +    453 68.180.229.254
    +    593 54.91.48.104
    +    757 104.196.152.243
    +    776 66.249.66.90
     # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "17/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    - 11 205.201.132.14
    - 11 40.77.167.124
    - 15 35.226.23.240
    - 16 157.55.39.231
    - 16 66.249.64.155
    - 18 66.249.66.90
    - 22 95.108.181.88
    - 58 104.196.152.243
    -4106 70.32.83.92
    -9229 45.5.184.196
    -
  • - -
  • But I do see this strange message in the dspace log:

    - + 11 205.201.132.14 + 11 40.77.167.124 + 15 35.226.23.240 + 16 157.55.39.231 + 16 66.249.64.155 + 18 66.249.66.90 + 22 95.108.181.88 + 58 104.196.152.243 + 4106 70.32.83.92 + 9229 45.5.184.196 +
    2018-01-17 07:59:25,856 INFO  org.apache.http.impl.client.SystemDefaultHttpClient @ I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://localhost:8081: The target server failed to respond
     2018-01-17 07:59:25,856 INFO  org.apache.http.impl.client.SystemDefaultHttpClient @ Retrying request to {}->http://localhost:8081
    -
  • - -
  • I have NEVER seen this error before, and there is no error before or after that in DSpace’s solr.log

  • - -
  • Tomcat’s catalina.out does show something interesting, though, right at that time:

    - +
    [====================>                              ]40% time remaining: 7 hour(s) 14 minute(s) 45 seconds. timestamp: 2018-01-17 07:57:02
     [====================>                              ]40% time remaining: 7 hour(s) 14 minute(s) 45 seconds. timestamp: 2018-01-17 07:57:11
     [====================>                              ]40% time remaining: 7 hour(s) 14 minute(s) 44 seconds. timestamp: 2018-01-17 07:57:37
     [====================>                              ]40% time remaining: 7 hour(s) 16 minute(s) 5 seconds. timestamp: 2018-01-17 07:57:49
     Exception in thread "http-bio-127.0.0.1-8081-exec-627" java.lang.OutOfMemoryError: Java heap space
    -    at org.apache.lucene.util.FixedBitSet.clone(FixedBitSet.java:576)
    -    at org.apache.solr.search.BitDocSet.andNot(BitDocSet.java:222)
    -    at org.apache.solr.search.SolrIndexSearcher.getProcessedFilter(SolrIndexSearcher.java:1067)
    -    at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1557)
    -    at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1433)
    -    at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:514)
    -    at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:485)
    -    at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:218)
    -    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
    -    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
    -    at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
    -    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
    -    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
    -    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    -    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    -    at org.dspace.solr.filters.LocalHostRestrictionFilter.doFilter(LocalHostRestrictionFilter.java:50)
    -    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    -    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    -    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
    -    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
    -    at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)
    -    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
    -    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
    -    at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180)
    -    at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956)
    -    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
    -    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436)
    -    at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078)
    -    at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625)
    -    at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:318) 
    -    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    -    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    -
  • - -
  • You can see the timestamp above, which is some Atmire nightly task I think, but I can’t figure out which one

  • - -
  • So I restarted Tomcat and tried the import again, which finished very quickly and without errors!

    - -
    $ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFormat -m lives2.map &> lives2.log
    -
  • - -
  • Looking at the JVM graphs from Munin it does look like the heap ran out of memory (see the blue dip just before the green spike when I restarted Tomcat):

  • + at org.apache.lucene.util.FixedBitSet.clone(FixedBitSet.java:576) + at org.apache.solr.search.BitDocSet.andNot(BitDocSet.java:222) + at org.apache.solr.search.SolrIndexSearcher.getProcessedFilter(SolrIndexSearcher.java:1067) + at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1557) + at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1433) + at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:514) + at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:485) + at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:218) + at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) + at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) + at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777) + at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) + at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) + at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) + at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) + at org.dspace.solr.filters.LocalHostRestrictionFilter.doFilter(LocalHostRestrictionFilter.java:50) + at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) + at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) + at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221) + at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122) + at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505) + at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169) + at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103) + at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180) + at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956) + at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116) + at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436) + at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078) + at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625) + at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:318) + at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) + at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) + - -

    Tomcat JVM Heap

    - +
    $ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFormat -m lives2.map &> lives2.log
    +
    +

    Tomcat JVM Heap

    $ docker pull docker.bintray.io/jfrog/artifactory-oss:latest
     $ docker volume create --name artifactory5_data
     $ docker network create dspace-build
     $ docker run --network dspace-build --name artifactory -d -v artifactory5_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss:latest
    -
    - -
  • Then configure the local maven to use it in settings.xml with the settings from “Set Me Up”: https://www.jfrog.com/confluence/display/RTF/Using+Artifactory

  • - -
  • This could be a game changer for testing and running the Docker DSpace image

  • - -
  • Wow, I even managed to add the Atmire repository as a remote and map it into the libs-release virtual repository, then tell maven to use it for atmire.com-releases in settings.xml!

  • - -
  • Hmm, some maven dependencies for the SWORDv2 web application in DSpace 5.5 are broken:

    - +
    [ERROR] Failed to execute goal on project dspace-swordv2: Could not resolve dependencies for project org.dspace:dspace-swordv2:war:5.5: Failed to collect dependencies at org.swordapp:sword2-server:jar:classes:1.0 -> org.apache.abdera:abdera-client:jar:1.1.1 -> org.apache.abdera:abdera-core:jar:1.1.1 -> org.apache.abdera:abdera-i18n:jar:1.1.1 -> org.apache.geronimo.specs:geronimo-activation_1.0.2_spec:jar:1.1: Failed to read artifact descriptor for org.apache.geronimo.specs:geronimo-activation_1.0.2_spec:jar:1.1: Could not find artifact org.apache.geronimo.specs:specs:pom:1.1 in central (http://localhost:8081/artifactory/libs-release) -> [Help 1]
    -
  • - -
  • I never noticed because I build with that web application disabled:

    - +
    $ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=localhost -P \!dspace-sword,\!dspace-swordv2 clean package
    -
  • - -
  • UptimeRobot said CGSpace went down for a few minutes

  • - -
  • I didn’t do anything but it came back up on its own

  • - -
  • I don’t see anything unusual in the XMLUI or REST/OAI logs

  • - -
  • Now Linode alert says the CPU load is high, sigh

  • - -
  • Regarding the heap space error earlier today, it looks like it does happen a few times a week or month (I’m not sure how far these logs go back, as they are not strictly daily):

    - +
    # zgrep -c java.lang.OutOfMemoryError /var/log/tomcat7/catalina.out* | grep -v :0
     /var/log/tomcat7/catalina.out:2
     /var/log/tomcat7/catalina.out.10.gz:7
    @@ -1165,106 +986,83 @@ $ docker run --network dspace-build --name artifactory -d -v artifactory5_data:/
     /var/log/tomcat7/catalina.out.4.gz:3
     /var/log/tomcat7/catalina.out.6.gz:2
     /var/log/tomcat7/catalina.out.7.gz:14
    -
  • - -
  • Overall the heap space usage in the munin graph seems ok, though I usually increase it by 512MB over the average a few times per year as usage grows

  • - -
  • But maybe I should increase it by more, like 1024MB, to give a bit more head room

  • + - -

    2018-01-18

    - +

    2018-01-18

    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-discovery -b
    -
    - -
  • Maria from Bioversity asked if I could remove the abstracts from all of their Limited Access items in the Bioversity Journal Articles collection

  • - -
  • It’s easy enough to do in OpenRefine, but you have to be careful to only get those items that are uploaded into Bioversity’s collection, not the ones that are mapped from others!

  • - -
  • Use this GREL in OpenRefine after isolating all the Limited Access items: value.startsWith("10568/35501")

  • - -
  • UptimeRobot said CGSpace went down AGAIN and both Sisay and Danny immediately logged in and restarted Tomcat without talking to me or each other!

    - +
    Jan 18 07:01:22 linode18 sudo[10805]: dhmichael : TTY=pts/5 ; PWD=/home/dhmichael ; USER=root ; COMMAND=/bin/systemctl restart tomcat7
     Jan 18 07:01:22 linode18 sudo[10805]: pam_unix(sudo:session): session opened for user root by dhmichael(uid=0)
     Jan 18 07:01:22 linode18 systemd[1]: Stopping LSB: Start Tomcat....
     Jan 18 07:01:22 linode18 sudo[10812]: swebshet : TTY=pts/3 ; PWD=/home/swebshet ; USER=root ; COMMAND=/bin/systemctl restart tomcat7
     Jan 18 07:01:22 linode18 sudo[10812]: pam_unix(sudo:session): session opened for user root by swebshet(uid=0)
    -
  • - -
  • I had to cancel the Discovery indexing and I’ll have to re-try it another time when the server isn’t so busy (it had already taken two hours and wasn’t even close to being done)

  • - -
  • For now I’ve increased the Tomcat JVM heap from 5632 to 6144m, to give ~1GB of free memory over the average usage to hopefully account for spikes caused by load or background jobs

  • + - -

    2018-01-19

    - +

    2018-01-19

    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-discovery -b
    -
    - -
  • Linode alerted again and said that CGSpace was using 301% CPU

  • - -
  • Peter emailed to ask why this item doesn’t have an Altmetric badge on CGSpace but does have one on the Altmetric dashboard

  • - -
  • Looks like our badge code calls the handle endpoint which doesn’t exist:

    - -
    https://api.altmetric.com/v1/handle/10568/88090
    -
  • - -
  • I told Peter we should keep an eye out and try again next week

  • + - -

    2018-01-20

    - +
    https://api.altmetric.com/v1/handle/10568/88090
    +
    +

    2018-01-20

    $ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-authority 
     Retrieving all data 
     Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer 
     Exception: null
     java.lang.NullPointerException
    -    at org.dspace.authority.AuthorityValueGenerator.generateRaw(AuthorityValueGenerator.java:82)
    -    at org.dspace.authority.AuthorityValueGenerator.generate(AuthorityValueGenerator.java:39)
    -    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.prepareNextValue(DSpaceAuthorityIndexer.java:201)
    -    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:132)
    -    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    -    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    -    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:159)
    -    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    -    at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    -    at org.dspace.authority.indexer.AuthorityIndexClient.main(AuthorityIndexClient.java:61)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -    at java.lang.reflect.Method.invoke(Method.java:498)
    -    at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    -    at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    +        at org.dspace.authority.AuthorityValueGenerator.generateRaw(AuthorityValueGenerator.java:82)
    +        at org.dspace.authority.AuthorityValueGenerator.generate(AuthorityValueGenerator.java:39)
    +        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.prepareNextValue(DSpaceAuthorityIndexer.java:201)
    +        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:132)
    +        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    +        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    +        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:159)
    +        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    +        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
    +        at org.dspace.authority.indexer.AuthorityIndexClient.main(AuthorityIndexClient.java:61)
    +        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    +        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    +        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    +        at java.lang.reflect.Method.invoke(Method.java:498)
    +        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    +        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
      
     real    7m2.241s
     user    1m33.198s
     sys     0m12.317s
    -
    - -
  • I tested the abstract cleanups on Bioversity’s Journal Articles collection again that I had started a few days ago

  • - -
  • In the end there were 324 items in the collection that were Limited Access, but only 199 had abstracts

  • - -
  • I want to document the workflow of adding a production PostgreSQL database to a development instance of DSpace in Docker:

    - +
    $ docker exec dspace_db dropdb -U postgres dspace
     $ docker exec dspace_db createdb -U postgres -O dspace --encoding=UNICODE dspace
     $ docker exec dspace_db psql -U postgres dspace -c 'alter user dspace createuser;'
    @@ -1274,15 +1072,10 @@ $ docker exec dspace_db psql -U postgres dspace -c 'alter user dspace nocreateus
     $ docker exec dspace_db vacuumdb -U postgres dspace
     $ docker cp ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspace_db:/tmp
     $ docker exec dspace_db psql -U dspace -f /tmp/update-sequences.sql dspace
    -
  • - - -

    2018-01-22

    - +

    2018-01-22

    +
  • I wrote a quick Python script to use the DSpace REST API to find all collections under a given community
  • The source code is here: rest-find-collections.py
  • - -
  • Peter had said that found a bunch of ILRI collections that were called “untitled”, but I don’t see any:

    - +
  • Peter had said that found a bunch of ILRI collections that were called “untitled”, but I don't see any:
  • +
    $ ./rest-find-collections.py 10568/1 | wc -l
     308
     $ ./rest-find-collections.py 10568/1 | grep -i untitled
    -
    - -
  • Looking at the Tomcat connector docs I think we really need to increase maxThreads

  • - -
  • The default is 200, which can easily be taken up by bots considering that Google and Bing each browse with fifty (50) connections each sometimes!

  • - -
  • Before I increase this I want to see if I can measure and graph this, and then benchmark

  • - -
  • I’ll probably also increase minSpareThreads to 20 (its default is 10)

  • - -
  • I still want to bump up acceptorThreadCount from 1 to 2 as well, as the documentation says this should be increased on multi-core systems

  • - -
  • I spent quite a bit of time looking at jvisualvm and jconsole today

  • - -
  • Run system updates on DSpace Test and reboot it

  • - -
  • I see I can monitor the number of Tomcat threads and some detailed JVM memory stuff if I install munin-plugins-java

  • - -
  • I’d still like to get arbitrary mbeans like activeSessions etc, though

  • - -
  • I can’t remember if I had to configure the jmx settings in /etc/munin/plugin-conf.d/munin-node or not—I think all I did was re-run the munin-node-configure script and of course enable JMX in Tomcat’s JVM options

  • + - -

    2018-01-23

    - +

    2018-01-23

    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -c -v "/admin"
     56405
    -
    - -
  • Apparently about 28% of these requests were for bitstreams, 30% for the REST API, and 30% for handles:

    - -
    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -Eo "^/(handle|bitstream|rest|oai)/" | sort | uniq -c | sort -n
    - 38 /oai/
    -14406 /bitstream/
    -15179 /rest/
    -15191 /handle/
    -
  • - -
  • And 3% were to the homepage or search:

    - -
    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -Eo '^/($|open-search|discover)' | sort | uniq -c
    -1050 /
    -413 /discover
    -170 /open-search
    -
  • - -
  • The last 10% or so seem to be for static assets that would be served by nginx anyways:

    - -
    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -v bitstream | grep -Eo '\.(js|css|png|jpg|jpeg|php|svg|gif|txt|map)$' | sort | uniq -c | sort -n
    -  2 .gif
    -  7 .css
    - 84 .js
    -433 .php
    -882 .txt
    -2551 .png
    -
  • - -
  • I can definitely design a test plan on this!

  • + - -

    2018-01-24

    - +
    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -Eo "^/(handle|bitstream|rest|oai)/" | sort | uniq -c | sort -n
    +     38 /oai/
    +  14406 /bitstream/
    +  15179 /rest/
    +  15191 /handle/
    +
    +
    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -Eo '^/($|open-search|discover)' | sort | uniq -c
    +   1050 /
    +    413 /discover
    +    170 /open-search
    +
    +
    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -v bitstream | grep -Eo '\.(js|css|png|jpg|jpeg|php|svg|gif|txt|map)$' | sort | uniq -c | sort -n
    +      2 .gif
    +      7 .css
    +     84 .js
    +    433 .php
    +    882 .txt
    +   2551 .png
    +
    +

    2018-01-24

    # zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/library-access.log.4.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/rest.log.4.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/oai.log.4.gz /var/log/nginx/error.log.3.gz /var/log/nginx/error.log.4.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -E "^/rest" | grep -Eo "(retrieve|expand=[a-z].*)" | sort | uniq -c | sort -n
    -  1 expand=collections
    - 16 expand=all&limit=1
    - 45 expand=items
    -775 retrieve
    -5675 expand=all
    -8633 expand=metadata
    -
    - -
  • I finished creating the test plan for DSpace Test and ran it from my Linode with:

    - + 1 expand=collections + 16 expand=all&limit=1 + 45 expand=items + 775 retrieve + 5675 expand=all + 8633 expand=metadata +
    $ jmeter -n -t DSpacePerfTest-dspacetest.cgiar.org.jmx -l 2018-01-24-1.jtl
    -
  • - -
  • Atmire responded to my issue from two weeks ago and said they will start looking into DSpace 5.8 compatibility for CGSpace

  • - -
  • I set up a new Arch Linux Linode instance with 8192 MB of RAM and ran the test plan a few times to get a baseline:

    - +
    # lscpu
     # lscpu 
     Architecture:        x86_64
    @@ -1416,7 +1187,7 @@ L3 cache:            16384K
     NUMA node0 CPU(s):   0-3
     Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm cpuid_fault invpcid_single pti retpoline fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat
     # free -m
    -          total        used        free      shared  buff/cache   available
    +              total        used        free      shared  buff/cache   available
     Mem:           7970         107        7759           1         103        7771
     Swap:           255           0         255
     # pacman -Syu
    @@ -1430,68 +1201,52 @@ $ cd apache-jmeter-3.3/bin
     $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-24-linode5451120-baseline.jtl -j ~/dspace-performance-test/2018-01-24-linode5451120-baseline.log
     $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-24-linode5451120-baseline2.jtl -j ~/dspace-performance-test/2018-01-24-linode5451120-baseline2.log
     $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-24-linode5451120-baseline3.jtl -j ~/dspace-performance-test/2018-01-24-linode5451120-baseline3.log
    -
  • - -
  • Then I generated reports for these runs like this:

    - -
    $ jmeter -g 2018-01-24-linode5451120-baseline.jtl -o 2018-01-24-linode5451120-baseline
    -
  • + - -

    2018-01-25

    - +
    $ jmeter -g 2018-01-24-linode5451120-baseline.jtl -o 2018-01-24-linode5451120-baseline
    +

    2018-01-25

    $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads.log
     $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads2.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads2.log
     $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads3.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads3.log
    -
    - -
  • I changed the parameters back to the baseline ones and switched the Tomcat JVM garbage collector to G1GC and re-ran the tests

  • - -
  • JVM options for Tomcat changed from -Xms3072m -Xmx3072m -XX:+UseConcMarkSweepGC to -Xms3072m -Xmx3072m -XX:+UseG1GC -XX:+PerfDisableSharedMem

    - +
    $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-g1gc.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-g1gc.log
     $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-g1gc2.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-g1gc2.log
     $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-g1gc3.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-g1gc3.log
    -
  • - -
  • I haven’t had time to look at the results yet

  • + - -

    2018-01-26

    - +

    2018-01-26

    - -

    Rights

    - +
    <pair>
    +  <displayed-value>For products published by another party:</displayed-value>
    +  <stored-value></stored-value>
    +</pair>
    +
    +

    Rights

    - -

    2018-01-28

    - +

    2018-01-28

    - -

    2018-01-29

    - +

    2018-01-29

    2018-01-29 05:30:22,226 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=3775D4125D28EF0C691B08345D905141:ip_addr=68.180.229.254:view_item:handle=10568/71890
     2018-01-29 05:30:22,322 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractSearch @ org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1994+TO+1999]': Encountered " "]" "] "" at line 1, column 32.
     Was expecting one of:
    -"TO" ...
    -<RANGE_QUOTED> ...
    -<RANGE_GOOP> ...
    +    "TO" ...
    +    <RANGE_QUOTED> ...
    +    <RANGE_GOOP> ...
         
     org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1994+TO+1999]': Encountered " "]" "] "" at line 1, column 32.
     Was expecting one of:
    -"TO" ...
    -<RANGE_QUOTED> ...
    -<RANGE_GOOP> ...
    -
    - -
  • So is this an error caused by this particular client (which happens to be Yahoo! Slurp)?

  • - -
  • I see a few dozen HTTP 499 errors in the nginx access log for a few minutes before this happened, but HTTP 499 is just when nginx says that the client closed the request early

  • - -
  • Perhaps this from the nginx error log is relevant?

    - + "TO" ... + <RANGE_QUOTED> ... + <RANGE_GOOP> ... +
    2018/01/29 05:26:34 [warn] 26895#26895: *944759 an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/6/16/0000026166 while reading upstream, client: 180.76.15.34, server: cgspace.cgiar.org, request: "GET /bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12 HTTP/1.1", upstream: "http://127.0.0.1:8443/bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12", host: "cgspace.cgiar.org"
    -
  • - -
  • I think that must be unrelated, probably the client closed the request to nginx because DSpace (Tomcat) was taking too long

  • - -
  • An interesting snippet to get the maximum and average nginx responses:

    - +
    # awk '($9 ~ /200/) { i++;sum+=$10;max=$10>max?$10:max; } END { printf("Maximum: %d\nAverage: %d\n",max,i?sum/i:0); }' /var/log/nginx/access.log
     Maximum: 2771268
     Average: 210483
    -
  • - -
  • I guess responses that don’t fit in RAM get saved to disk (a default of 1024M), so this is definitely not the issue here, and that warning is totally unrelated

  • - -
  • My best guess is that the Solr search error is related somehow but I can’t figure it out

  • - -
  • We definitely have enough database connections, as I haven’t seen a pool error in weeks:

    - +
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-2*
     dspace.log.2018-01-20:0
     dspace.log.2018-01-21:0
    @@ -1556,164 +1300,130 @@ dspace.log.2018-01-26:0
     dspace.log.2018-01-27:0
     dspace.log.2018-01-28:0
     dspace.log.2018-01-29:0
    -
  • - -
  • Adam Hunt from WLE complained that pages take “1-2 minutes” to load each, from France and Sri Lanka

  • - -
  • I asked him which particular pages, as right now pages load in 2 or 3 seconds for me

  • - -
  • UptimeRobot said CGSpace went down again, and I looked at PostgreSQL and saw 211 active database connections

  • - -
  • If it’s not memory and it’s not database, it’s gotta be Tomcat threads, seeing as the default maxThreads is 200 anyways, it actually makes sense

  • - -
  • I decided to change the Tomcat thread settings on CGSpace:

    - + +
  • +
  • Looks like I only enabled the new thread stuff on the connector used internally by Solr, so I probably need to match that by increasing them on the other connector that nginx proxies to
  • +
  • Jesus Christ I need to fucking fix the Munin monitoring so that I can tell how many fucking threads I have running
  • +
  • Wow, so apparently you need to specify which connector to check if you want any of the Munin Tomcat plugins besides “tomcat_jvm” to work (the connector name can be seen in the Catalina logs)
  • +
  • I modified /etc/munin/plugin-conf.d/tomcat to add the connector (with surrounding quotes!) and now the other plugins work (obviously the credentials are incorrect):
  • +
    [tomcat_*]
    -env.host 127.0.0.1
    -env.port 8081
    -env.connector "http-bio-127.0.0.1-8443"
    -env.user munin
    -env.password munin
    -
    - -
  • For example, I can see the threads:

    - + env.host 127.0.0.1 + env.port 8081 + env.connector "http-bio-127.0.0.1-8443" + env.user munin + env.password munin +
    # munin-run tomcat_threads
     busy.value 0
     idle.value 20
     max.value 400
    -
  • - -
  • Apparently you can’t monitor more than one connector, so I guess the most important to monitor would be the one that nginx is sending stuff to

  • - -
  • So for now I think I’ll just monitor these and skip trying to configure the jmx plugins

  • - -
  • Although following the logic of _/usr/share/munin/plugins/jmx_tomcatdbpools could be useful for getting the active Tomcat sessions

  • - -
  • From debugging the jmx_tomcat_db_pools script from the munin-plugins-java package, I see that this is how you call arbitrary mbeans:

    - +
    # port=5400 ip="127.0.0.1" /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=DataSource,class=javax.sql.DataSource,name=* maxActive
     Catalina:type=DataSource,class=javax.sql.DataSource,name="jdbc/dspace"  maxActive       300
    -
  • - -
  • More notes here: https://github.com/munin-monitoring/contrib/tree/master/plugins/jmx

  • - -
  • Looking at the Munin graphs, I that the load is 200% every morning from 03:00 to almost 08:00

  • - -
  • Tomcat’s catalina.out log file is full of spam from this thing too, with lines like this

    - +
    [===================>                               ]38% time remaining: 5 hour(s) 21 minute(s) 47 seconds. timestamp: 2018-01-29 06:25:16
    -
  • - -
  • There are millions of these status lines, for example in just this one log file:

    - +
    # zgrep -c "time remaining" /var/log/tomcat7/catalina.out.1.gz
     1084741
    -
  • - -
  • I filed a ticket with Atmire: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566

  • + - -

    2018-01-31

    - +

    2018-01-31

    # munin-run tomcat_threads
     busy.value 400
     idle.value 0
     max.value 400
    -
    - -
  • And wow, we finally exhausted the database connections, from dspace.log:

    - +
    2018-01-31 08:05:28,964 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - 
     org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-451] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:300; busy:300; idle:0; lastwait:5000].
    -
  • - -
  • Now even the nightly Atmire background thing is getting HTTP 500 error:

    - +
    Jan 31, 2018 8:16:05 AM com.sun.jersey.spi.container.ContainerResponse logException
     SEVERE: Mapped exception to response: 500 (Internal Server Error)
     javax.ws.rs.WebApplicationException
    -
  • - -
  • For now I will restart Tomcat to clear this shit and bring the site back up

  • - -
  • The top IPs from this morning, during 7 and 8AM in XMLUI and REST/OAI:

    - -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    - 67 66.249.66.70
    - 70 207.46.13.12
    - 71 197.210.168.174
    - 83 207.46.13.13
    - 85 157.55.39.79
    - 89 207.46.13.14
    -123 68.180.228.157
    -198 66.249.66.90
    -219 41.204.190.40
    -255 2405:204:a208:1e12:132:2a8e:ad28:46c0
    -# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -  2 65.55.210.187
    -  2 66.249.66.90
    -  3 157.55.39.79
    -  4 197.232.39.92
    -  4 34.216.252.127
    -  6 104.196.152.243
    -  6 213.55.85.89
    - 15 122.52.115.13
    - 16 213.55.107.186
    -596 45.5.184.196
    -
  • - -
  • This looks reasonable to me, so I have no idea why we ran out of Tomcat threads

  • + - -

    Tomcat threads

    - +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +     67 66.249.66.70
    +     70 207.46.13.12
    +     71 197.210.168.174
    +     83 207.46.13.13
    +     85 157.55.39.79
    +     89 207.46.13.14
    +    123 68.180.228.157
    +    198 66.249.66.90
    +    219 41.204.190.40
    +    255 2405:204:a208:1e12:132:2a8e:ad28:46c0
    +# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +      2 65.55.210.187
    +      2 66.249.66.90
    +      3 157.55.39.79
    +      4 197.232.39.92
    +      4 34.216.252.127
    +      6 104.196.152.243
    +      6 213.55.85.89
    +     15 122.52.115.13
    +     16 213.55.107.186
    +    596 45.5.184.196
    +
    +

    Tomcat threads

    - -

    CPU usage week

    - +

    CPU usage week

    # port=5400 ip="127.0.0.1" /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=Manager,context=/,host=localhost activeSessions
     Catalina:type=Manager,context=/,host=localhost  activeSessions  8
    -
    - -
  • If you connect to Tomcat in jvisualvm it’s pretty obvious when you hover over the elements

  • + - -

    MBeans in JVisualVM

    +

    MBeans in JVisualVM

    diff --git a/docs/2018-02/index.html b/docs/2018-02/index.html index ddad78e99..9b44ca1b2 100644 --- a/docs/2018-02/index.html +++ b/docs/2018-02/index.html @@ -8,11 +8,10 @@ @@ -23,13 +22,12 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-pl - + @@ -110,162 +108,122 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-pl

    -

    2018-02-01

    - +

    2018-02-01

    - -

    DSpace Sessions

    - +

    DSpace Sessions

    # munin-run jmx_dspace_sessions
     v_.value 223
     v_jspui.value 1
     v_oai.value 0
    -
    - - -

    2018-02-03

    - +

    2018-02-03

    $ ./delete-metadata-values.py -i /tmp/2018-02-03-Affiliations-12-deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
     $ ./fix-metadata-values.py -i /tmp/2018-02-03-Affiliations-1116-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
    -
    - -
  • Then I started a full Discovery reindex:

    - +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
     
     real    96m39.823s
     user    14m10.975s
     sys     2m29.088s
    -
  • - -
  • Generate a new list of affiliations for Peter to sort through:

    - +
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
     COPY 3723
    -
  • - -
  • Oh, and it looks like we processed over 3.1 million requests in January, up from 2.9 million in December:

    - +
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2018"
     3126109
     
     real    0m23.839s
     user    0m27.225s
     sys     0m1.905s
    -
  • - - -

    2018-02-05

    - +

    2018-02-05

    dspace=# update metadatavalue set text_value=REGEXP_REPLACE(text_value, '\s+$' , '') where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*?\s+$';
     UPDATE 20
    -
    - -
  • I tried the TRIM(TRAILING from text_value) function and it said it changed 20 items but the spaces didn’t go away

  • - -
  • This is on a fresh import of the CGSpace database, but when I tried to apply it on CGSpace there were no changes detected. Weird.

  • - -
  • Anyways, Peter wants a new list of authors to clean up, so I exported another CSV:

    - +
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors-2018-02-05.csv with csv;
     COPY 55630
    -
  • - - -

    2018-02-06

    - +

    2018-02-06

    # date
     Tue Feb  6 09:30:32 UTC 2018
     # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "6/Feb/2018:(08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -  2 223.185.41.40
    -  2 66.249.64.14
    -  2 77.246.52.40
    -  4 157.55.39.82
    -  4 193.205.105.8
    -  5 207.46.13.63
    -  5 207.46.13.64
    -  6 154.68.16.34
    -  7 207.46.13.66
    -1548 50.116.102.77
    +      2 223.185.41.40
    +      2 66.249.64.14
    +      2 77.246.52.40
    +      4 157.55.39.82
    +      4 193.205.105.8
    +      5 207.46.13.63
    +      5 207.46.13.64
    +      6 154.68.16.34
    +      7 207.46.13.66
    +   1548 50.116.102.77
     # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "6/Feb/2018:(08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    - 77 213.55.99.121
    - 86 66.249.64.14
    -101 104.196.152.243
    -103 207.46.13.64
    -118 157.55.39.82
    -133 207.46.13.66
    -136 207.46.13.63
    -156 68.180.228.157
    -295 197.210.168.174
    -752 144.76.64.79
    -
    - -
  • I did notice in /var/log/tomcat7/catalina.out that Atmire’s update thing was running though

  • - -
  • So I restarted Tomcat and now everything is fine

  • - -
  • Next time I see that many database connections I need to save the output so I can analyze it later

  • - -
  • I’m going to re-schedule the taskUpdateSolrStatsMetadata task as Bram detailed in ticket 566 to see if it makes CGSpace stop crashing every morning

  • - -
  • If I move the task from 3AM to 3PM, deally CGSpace will stop crashing in the morning, or start crashing ~12 hours later

  • - -
  • Eventually Atmire has said that there will be a fix for this high load caused by their script, but it will come with the 5.8 compatability they are already working on

  • - -
  • I re-deployed CGSpace with the new task time of 3PM, ran all system updates, and restarted the server

  • - -
  • Also, I changed the name of the DSpace fallback pool on DSpace Test and CGSpace to be called ‘dspaceCli’ so that I can distinguish it in pg_stat_activity

  • - -
  • I implemented some changes to the pooling in the Ansible infrastructure scripts so that each DSpace web application can use its own pool (web, api, and solr)

  • - -
  • Each pool uses its own name and hopefully this should help me figure out which one is using too many connections next time CGSpace goes down

  • - -
  • Also, this will mean that when a search bot comes along and hammers the XMLUI, the REST and OAI applications will be fine

  • - -
  • I’m not actually sure if the Solr web application uses the database though, so I’ll have to check later and remove it if necessary

  • - -
  • I deployed the changes on DSpace Test only for now, so I will monitor and make them on CGSpace later this week

  • + 77 213.55.99.121 + 86 66.249.64.14 + 101 104.196.152.243 + 103 207.46.13.64 + 118 157.55.39.82 + 133 207.46.13.66 + 136 207.46.13.63 + 156 68.180.228.157 + 295 197.210.168.174 + 752 144.76.64.79 + - -

    2018-02-07

    - +

    2018-02-07

    $ psql -c 'select * from pg_stat_activity' > /tmp/pg_stat_activity.txt
     $ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
     /tmp/pg_stat_activity1.txt:300
    @@ -273,86 +231,71 @@ $ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
     /tmp/pg_stat_activity3.txt:168
     /tmp/pg_stat_activity4.txt:5
     /tmp/pg_stat_activity5.txt:6
    -
    - -
  • Interestingly, all of those 751 connections were idle!

    - +
    $ grep "PostgreSQL JDBC" /tmp/pg_stat_activity* | grep -c idle
     751
    -
  • - -
  • Since I was restarting Tomcat anyways, I decided to deploy the changes to create two different pools for web and API apps

  • - -
  • Looking the Munin graphs, I can see that there were almost double the normal number of DSpace sessions at the time of the crash (and also yesterday!):

  • + - -

    DSpace Sessions

    - +

    DSpace Sessions

    $ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     1828
    -
    - -
  • CGSpace went down again a few hours later, and now the connections to the dspaceWeb pool are maxed at 250 (the new limit I imposed with the new separate pool scheme)

  • - -
  • What’s interesting is that the DSpace log says the connections are all busy:

    - +
    org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-328] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
    -
  • - -
  • … but in PostgreSQL I see them idle or idle in transaction:

    - +
    $ psql -c 'select * from pg_stat_activity' | grep -c dspaceWeb
     250
     $ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c idle
     250
     $ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c "idle in transaction"
     187
    -
  • - -
  • What the fuck, does DSpace think all connections are busy?

  • - -
  • I suspect these are issues with abandoned connections or maybe a leak, so I’m going to try adding the removeAbandoned='true' parameter which is apparently off by default

  • - -
  • I will try testOnReturn='true' too, just to add more validation, because I’m fucking grasping at straws

  • - -
  • Also, WTF, there was a heap space error randomly in catalina.out:

    - +
    Wed Feb 07 15:01:54 UTC 2018 | Query:containerItem:91917 AND type:2
     Exception in thread "http-bio-127.0.0.1-8081-exec-58" java.lang.OutOfMemoryError: Java heap space
    -
  • - -
  • I’m trying to find a way to determine what was using all those Tomcat sessions, but parsing the DSpace log is hard because some IPs are IPv6, which contain colons!

  • - -
  • Looking at the first crash this morning around 11, I see these IPv4 addresses making requests around 10 and 11AM:

    - +
    $ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'ip_addr=[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' | sort -n | uniq -c | sort -n | tail -n 20
    - 34 ip_addr=46.229.168.67
    - 34 ip_addr=46.229.168.73
    - 37 ip_addr=46.229.168.76
    - 40 ip_addr=34.232.65.41
    - 41 ip_addr=46.229.168.71
    - 44 ip_addr=197.210.168.174
    - 55 ip_addr=181.137.2.214
    - 55 ip_addr=213.55.99.121
    - 58 ip_addr=46.229.168.65
    - 64 ip_addr=66.249.66.91
    - 67 ip_addr=66.249.66.90
    - 71 ip_addr=207.46.13.54
    - 78 ip_addr=130.82.1.40
    -104 ip_addr=40.77.167.36
    -151 ip_addr=68.180.228.157
    -174 ip_addr=207.46.13.135
    -194 ip_addr=54.83.138.123
    -198 ip_addr=40.77.167.62
    -210 ip_addr=207.46.13.71
    -214 ip_addr=104.196.152.243
    -
  • - -
  • These IPs made thousands of sessions today:

    - + 34 ip_addr=46.229.168.67 + 34 ip_addr=46.229.168.73 + 37 ip_addr=46.229.168.76 + 40 ip_addr=34.232.65.41 + 41 ip_addr=46.229.168.71 + 44 ip_addr=197.210.168.174 + 55 ip_addr=181.137.2.214 + 55 ip_addr=213.55.99.121 + 58 ip_addr=46.229.168.65 + 64 ip_addr=66.249.66.91 + 67 ip_addr=66.249.66.90 + 71 ip_addr=207.46.13.54 + 78 ip_addr=130.82.1.40 + 104 ip_addr=40.77.167.36 + 151 ip_addr=68.180.228.157 + 174 ip_addr=207.46.13.135 + 194 ip_addr=54.83.138.123 + 198 ip_addr=40.77.167.62 + 210 ip_addr=207.46.13.71 + 214 ip_addr=104.196.152.243 +
    $ grep 104.196.152.243 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     530
     $ grep 207.46.13.71 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    @@ -374,223 +317,165 @@ $ grep 207.46.13.54 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}'
     $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     992
     
    -
  • - -
  • Let’s investigate who these IPs belong to:

    - + +
  • +
  • Nice, so these are all known bots that are already crammed into one session by Tomcat's Crawler Session Manager Valve.
  • +
  • What in the actual fuck, why is our load doing this? It's gotta be something fucked up with the database pool being “busy” but everything is fucking idle
  • +
  • One that I should probably add in nginx is 54.83.138.123, which is apparently the following user agent:
  • +
    BUbiNG (+http://law.di.unimi.it/BUbiNG.html)
    -
    - -
  • This one makes two thousand requests per day or so recently:

    - +
    # grep -c BUbiNG /var/log/nginx/access.log /var/log/nginx/access.log.1
     /var/log/nginx/access.log:1925
     /var/log/nginx/access.log.1:2029
    -
  • - -
  • And they have 30 IPs, so fuck that shit I’m going to add them to the Tomcat Crawler Session Manager Valve nowwww

  • - -
  • Lots of discussions on the dspace-tech mailing list over the last few years about leaky transactions being a known problem with DSpace

  • - -
  • Helix84 recommends restarting PostgreSQL instead of Tomcat because it restarts quicker

  • - -
  • This is how the connections looked when it crashed this afternoon:

    - -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    -  5 dspaceApi
    -290 dspaceWeb
    -
  • - -
  • This is how it is right now:

    - -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    -  5 dspaceApi
    -  5 dspaceWeb
    -
  • - -
  • So is this just some fucked up XMLUI database leaking?

  • - -
  • I notice there is an issue (that I’ve probably noticed before) on the Jira tracker about this that was fixed in DSpace 5.7: https://jira.duraspace.org/browse/DS-3551

  • - -
  • I seriously doubt this leaking shit is fixed for sure, but I’m gonna cherry-pick all those commits and try them on DSpace Test and probably even CGSpace because I’m fed up with this shit

  • - -
  • I cherry-picked all the commits for DS-3551 but it won’t build on our current DSpace 5.5!

  • - -
  • I sent a message to the dspace-tech mailing list asking why DSpace thinks these connections are busy when PostgreSQL says they are idle

  • + - -

    2018-02-10

    - +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +      5 dspaceApi
    +    290 dspaceWeb
    +
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +      5 dspaceApi
    +      5 dspaceWeb
    +
    +

    2018-02-10

    Field dc_contributor_author has choice presentation of type "select", it may NOT be authority-controlled.
    -
    - -
  • If I change choices.presentation to suggest it give this error:

    - +
    xmlui.mirage2.forms.instancedCompositeFields.noSuggestionError
    -
  • - -
  • So I don’t think we can disable the ORCID lookup function and keep the ORCID badges

  • + - -

    2018-02-11

    - +

    2018-02-11

    - -

    Weird thumbnail

    - +

    Weird thumbnail

    $ convert CCAFS_WP_223.pdf\[0\] -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_cmyk.icc -thumbnail 600x600 -flatten -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_rgb.icc CCAFS_WP_223.jpg
    -
    - - -

    Manual thumbnail

    - +

    Manual thumbnail

    $ isutf8 authors-2018-02-05.csv
     authors-2018-02-05.csv: line 100, char 18, byte 4179: After a first byte between E1 and EC, expecting the 2nd byte between 80 and BF.
    -
    - -
  • The isutf8 program comes from moreutils

  • - -
  • Line 100 contains: Galiè, Alessandra

  • - -
  • In other news, psycopg2 is splitting their package in pip, so to install the binary wheel distribution you need to use pip install psycopg2-binary

  • - -
  • See: http://initd.org/psycopg/articles/2018/02/08/psycopg-274-released/

  • - -
  • I updated my fix-metadata-values.py and delete-metadata-values.py scripts on the scripts page: https://github.com/ilri/DSpace/wiki/Scripts

  • - -
  • I ran the 342 author corrections (after trimming whitespace and excluding those with || and other syntax errors) on CGSpace:

    - +
    $ ./fix-metadata-values.py -i Correct-342-Authors-2018-02-11.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p 'fuuu'
    -
  • - -
  • Then I ran a full Discovery re-indexing:

    - +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    -
  • - -
  • That reminds me that Bizu had asked me to fix some of Alan Duncan’s names in December

  • - -
  • I see he actually has some variations with “Duncan, Alan J.”: https://cgspace.cgiar.org/discover?filtertype_1=author&filter_relational_operator_1=contains&filter_1=Duncan%2C+Alan&submit_apply_filter=&query=

  • - -
  • I will just update those for her too and then restart the indexing:

    - +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
    -text_value    |              authority               | confidence 
    +   text_value    |              authority               | confidence 
     -----------------+--------------------------------------+------------
    -Duncan, Alan J. | 5ff35043-942e-4d0a-b377-4daed6e3c1a3 |        600
    -Duncan, Alan J. | 62298c84-4d9d-4b83-a932-4a9dd4046db7 |         -1
    -Duncan, Alan J. |                                      |         -1
    -Duncan, Alan    | a6486522-b08a-4f7a-84f9-3a73ce56034d |        600
    -Duncan, Alan J. | cd0e03bf-92c3-475f-9589-60c5b042ea60 |         -1
    -Duncan, Alan J. | a6486522-b08a-4f7a-84f9-3a73ce56034d |         -1
    -Duncan, Alan J. | 5ff35043-942e-4d0a-b377-4daed6e3c1a3 |         -1
    -Duncan, Alan J. | a6486522-b08a-4f7a-84f9-3a73ce56034d |        600
    + Duncan, Alan J. | 5ff35043-942e-4d0a-b377-4daed6e3c1a3 |        600
    + Duncan, Alan J. | 62298c84-4d9d-4b83-a932-4a9dd4046db7 |         -1
    + Duncan, Alan J. |                                      |         -1
    + Duncan, Alan    | a6486522-b08a-4f7a-84f9-3a73ce56034d |        600
    + Duncan, Alan J. | cd0e03bf-92c3-475f-9589-60c5b042ea60 |         -1
    + Duncan, Alan J. | a6486522-b08a-4f7a-84f9-3a73ce56034d |         -1
    + Duncan, Alan J. | 5ff35043-942e-4d0a-b377-4daed6e3c1a3 |         -1
    + Duncan, Alan J. | a6486522-b08a-4f7a-84f9-3a73ce56034d |        600
     (8 rows)
     
     dspace=# begin;
     dspace=# update metadatavalue set text_value='Duncan, Alan', authority='a6486522-b08a-4f7a-84f9-3a73ce56034d', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Duncan, Alan%';
     UPDATE 216
     dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
    -text_value  |              authority               | confidence 
    +  text_value  |              authority               | confidence 
     --------------+--------------------------------------+------------
    -Duncan, Alan | a6486522-b08a-4f7a-84f9-3a73ce56034d |        600
    + Duncan, Alan | a6486522-b08a-4f7a-84f9-3a73ce56034d |        600
     (1 row)
     dspace=# commit;
    -
  • - -
  • Run all system updates on DSpace Test (linode02) and reboot it

  • - -
  • I wrote a Python script (resolve-orcids-from-solr.py) using SolrClient to parse the Solr authority cache for ORCID IDs

  • - -
  • We currently have 1562 authority records with ORCID IDs, and 624 unique IDs

  • - -
  • We can use this to build a controlled vocabulary of ORCID IDs for new item submissions

  • - -
  • I don’t know how to add ORCID IDs to existing items yet… some more querying of PostgreSQL for authority values perhaps?

  • - -
  • I added the script to the ILRI DSpace wiki on GitHub

  • + - -

    2018-02-12

    - +

    2018-02-12

    - -

    Atmire Workflow Statistics No Data Available

    - +

    Atmire Workflow Statistics No Data Available

    - -

    2018-02-13

    - +
    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*yabowork.*2017-12.*';
    +
    +

    2018-02-13

    2018-02-13 12:50:13,656 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error - 
     org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
     ...
     Caused by: java.net.SocketException: Socket closed
    -
    - -
  • Could be because of the removeAbandoned="true" that I enabled in the JDBC connection pool last week?

    - +
    $ grep -c "java.net.SocketException: Socket closed" dspace.log.2018-02-*
     dspace.log.2018-02-01:0
     dspace.log.2018-02-02:0
    @@ -605,301 +490,236 @@ dspace.log.2018-02-10:0
     dspace.log.2018-02-11:3
     dspace.log.2018-02-12:0
     dspace.log.2018-02-13:4
    -
  • - -
  • I apparently added that on 2018-02-07 so it could be, as I don’t see any of those socket closed errors in 2018-01’s logs!

  • - -
  • I will increase the removeAbandonedTimeout from its default of 60 to 90 and enable logAbandoned

  • - -
  • Peter hit this issue one more time, and this is apparently what Tomcat’s catalina.out log says when an abandoned connection is removed:

    - +
    Feb 13, 2018 2:05:42 PM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
     WARNING: Connection has been abandoned PooledConnection[org.postgresql.jdbc.PgConnection@22e107be]:java.lang.Exception
    -
  • - - -

    2018-02-14

    - +

    2018-02-14

    +
  • Atmire responded on the DSpace 5.8 compatability ticket and said they will let me know if they they want me to give them a clean 5.8 branch
  • - -
  • I formatted my list of ORCID IDs as a controlled vocabulary, sorted alphabetically, then ran through XML tidy:

    - +
  • I formatted my list of ORCID IDs as a controlled vocabulary, sorted alphabetically, then ran through XML tidy:
  • +
    $ sort cgspace-orcids.txt > dspace/config/controlled-vocabularies/cg-creator-id.xml
     $ add XML formatting...
     $ tidy -xml -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
    -
    - -
  • It seems the tidy fucks up accents, for example it turns Adriana Tofiño (0000-0001-7115-7169) into Adriana Tofiño (0000-0001-7115-7169)

  • - -
  • We need to force UTF-8:

    - +
    $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
    -
  • - -
  • This preserves special accent characters

  • - -
  • I tested the display and store of these in the XMLUI and PostgreSQL and it looks good

  • - -
  • Sisay exported all ILRI, CIAT, etc authors from ORCID and sent a list of 600+

  • - -
  • Peter combined it with mine and we have 1204 unique ORCIDs!

    - +
    $ grep -coE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_combined.csv
     1204
     $ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_combined.csv | sort | uniq | wc -l
     1204
    -
  • - -
  • Also, save that regex for the future because it will be very useful!

  • - -
  • CIAT sent a list of their authors’ ORCIDs and combined with ours there are now 1227:

    - +
    $ cat CGcenter_ORCID_ID_combined.csv ciat-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
     1227
    -
  • - -
  • There are some formatting issues with names in Peter’s list, so I should remember to re-generate the list of names from ORCID’s API once we’re done

  • - -
  • The dspace cleanup -v currently fails on CGSpace with the following:

    - -
    - Deleting bitstream record from database (ID: 149473)
    +
    +
     - Deleting bitstream record from database (ID: 149473)
     Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -Detail: Key (bitstream_id)=(149473) is still referenced from table "bundle".
    -
  • - -
  • The solution is to update the bitstream table, as I’ve discovered several other times in 2016 and 2017:

    - + Detail: Key (bitstream_id)=(149473) is still referenced from table "bundle". +
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (149473);'
     UPDATE 1
    -
  • - -
  • Then the cleanup process will continue for awhile and hit another foreign key conflict, and eventually it will complete after you manually resolve them all

  • + - -

    2018-02-15

    - +

    2018-02-15

    + +
  • And this item doesn't even exist on CGSpace!
  • Start working on XMLUI item display code for ORCIDs
  • Send emails to Macaroni Bros and Usman at CIFOR about ORCID metadata
  • CGSpace crashed while I was driving to Tel Aviv, and was down for four hours!
  • I only looked quickly in the logs but saw a bunch of database errors
  • - -
  • PostgreSQL connections are currently:

    - +
  • PostgreSQL connections are currently:
  • +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | uniq -c
    -  2 dspaceApi
    -  1 dspaceWeb
    -  3 dspaceApi
    -
    - -
  • I see shitloads of memory errors in Tomcat’s logs:

    - + 2 dspaceApi + 1 dspaceWeb + 3 dspaceApi +
    # grep -c "Java heap space" /var/log/tomcat7/catalina.out
     56
    -
  • - -
  • And shit tons of database connections abandoned:

    - +
    # grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' /var/log/tomcat7/catalina.out
     612
    -
  • - -
  • I have no fucking idea why it crashed

  • - -
  • The XMLUI activity looks like:

    - -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "15/Feb/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -715 63.143.42.244
    -746 213.55.99.121
    -886 68.180.228.157
    -967 66.249.66.90
    -1013 216.244.66.245
    -1177 197.210.168.174
    -1419 207.46.13.159
    -1512 207.46.13.59
    -1554 207.46.13.157
    -2018 104.196.152.243
    -
  • + - -

    2018-02-17

    - +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "15/Feb/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    715 63.143.42.244
    +    746 213.55.99.121
    +    886 68.180.228.157
    +    967 66.249.66.90
    +   1013 216.244.66.245
    +   1177 197.210.168.174
    +   1419 207.46.13.159
    +   1512 207.46.13.59
    +   1554 207.46.13.157
    +   2018 104.196.152.243
    +

    2018-02-17

    dspace=# update metadatavalue set text_value='United States Agency for International Development' where resource_type_id=2 and metadata_field_id=29 and text_value like '%U.S. Agency for International Development%';
     UPDATE 2
    -
    - - -

    2018-02-18

    - +

    2018-02-18

    - -

    Displaying ORCID iDs in XMLUI

    - +

    Displaying ORCID iDs in XMLUI

    $ wc -l dspace.log.2018-02-1{0..8}
    -383483 dspace.log.2018-02-10
    -275022 dspace.log.2018-02-11
    -249557 dspace.log.2018-02-12
    -280142 dspace.log.2018-02-13
    -615119 dspace.log.2018-02-14
    -4388259 dspace.log.2018-02-15
    -243496 dspace.log.2018-02-16
    -209186 dspace.log.2018-02-17
    -167432 dspace.log.2018-02-18
    -
    - -
  • From an average of a few hundred thousand to over four million lines in DSpace log?

  • - -
  • Using grep’s -B1 I can see the line before the heap space error, which has the time, ie:

    - + 383483 dspace.log.2018-02-10 + 275022 dspace.log.2018-02-11 + 249557 dspace.log.2018-02-12 + 280142 dspace.log.2018-02-13 + 615119 dspace.log.2018-02-14 + 4388259 dspace.log.2018-02-15 + 243496 dspace.log.2018-02-16 + 209186 dspace.log.2018-02-17 + 167432 dspace.log.2018-02-18 +
    2018-02-15 16:02:12,748 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
     org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
    -
  • - -
  • So these errors happened at hours 16, 18, 19, and 20

  • - -
  • Let’s see what was going on in nginx then:

    - +
    # zcat --force /var/log/nginx/*.log.{3,4}.gz | wc -l
     168571
     # zcat --force /var/log/nginx/*.log.{3,4}.gz | grep -E "15/Feb/2018:(16|18|19|20)" | wc -l
     8188
    -
  • - -
  • Only 8,000 requests during those four hours, out of 170,000 the whole day!

  • - -
  • And the usage of XMLUI, REST, and OAI looks SUPER boring:

    - -
    # zcat --force /var/log/nginx/*.log.{3,4}.gz | grep -E "15/Feb/2018:(16|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -111 95.108.181.88
    -158 45.5.184.221
    -201 104.196.152.243
    -205 68.180.228.157
    -236 40.77.167.131 
    -253 207.46.13.159
    -293 207.46.13.59
    -296 63.143.42.242
    -303 207.46.13.157
    -416 63.143.42.244
    -
  • - -
  • 63.143.42.244 is Uptime Robot, and 207.46.x.x is Bing!

  • - -
  • The DSpace sessions, PostgreSQL connections, and JVM memory all look normal

  • - -
  • I see a lot of AccessShareLock on February 15th…?

  • + - -

    PostgreSQL locks

    - +
    # zcat --force /var/log/nginx/*.log.{3,4}.gz | grep -E "15/Feb/2018:(16|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    111 95.108.181.88
    +    158 45.5.184.221
    +    201 104.196.152.243
    +    205 68.180.228.157
    +    236 40.77.167.131 
    +    253 207.46.13.159
    +    293 207.46.13.59
    +    296 63.143.42.242
    +    303 207.46.13.157
    +    416 63.143.42.244
    +
    +

    PostgreSQL locks

    - -

    2018-02-19

    - +

    2018-02-19

    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI-csv.csv CGcenter_ORCID_ID_combined.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l  
     1571
    -
    - -
  • I updated my resolve-orcids-from-solr.py script to be able to resolve ORCID identifiers from a text file so I renamed it to resolve-orcids.py

  • - -
  • Also, I updated it so it uses several new options:

    - +
    $ ./resolve-orcids.py -i input.txt -o output.txt
     $ cat output.txt 
     Ali Ramadhan: 0000-0001-5019-1368
     Ahmad Maryudi: 0000-0001-5051-7217
    -
  • - -
  • I was running this on the new list of 1571 and found an error:

    - +
    Looking up the name associated with ORCID iD: 0000-0001-9634-1958
     Traceback (most recent call last):
    -File "./resolve-orcids.py", line 111, in <module>
    -read_identifiers_from_file()
    -File "./resolve-orcids.py", line 37, in read_identifiers_from_file
    -resolve_orcid_identifiers(orcids)
    -File "./resolve-orcids.py", line 65, in resolve_orcid_identifiers
    -family_name = data['name']['family-name']['value']
    +  File "./resolve-orcids.py", line 111, in <module>
    +    read_identifiers_from_file()
    +  File "./resolve-orcids.py", line 37, in read_identifiers_from_file
    +    resolve_orcid_identifiers(orcids)
    +  File "./resolve-orcids.py", line 65, in resolve_orcid_identifiers
    +    family_name = data['name']['family-name']['value']
     TypeError: 'NoneType' object is not subscriptable
    -
  • - -
  • According to ORCID that identifier’s family-name is null so that sucks

  • - -
  • I fixed the script so that it checks if the family name is null

  • - -
  • Now another:

    - +
    Looking up the name associated with ORCID iD: 0000-0002-1300-3636
     Traceback (most recent call last):
    -File "./resolve-orcids.py", line 117, in <module>
    -read_identifiers_from_file()
    -File "./resolve-orcids.py", line 37, in read_identifiers_from_file
    -resolve_orcid_identifiers(orcids)
    -File "./resolve-orcids.py", line 65, in resolve_orcid_identifiers
    -if data['name']['given-names']:
    +  File "./resolve-orcids.py", line 117, in <module>
    +    read_identifiers_from_file()
    +  File "./resolve-orcids.py", line 37, in read_identifiers_from_file
    +    resolve_orcid_identifiers(orcids)
    +  File "./resolve-orcids.py", line 65, in resolve_orcid_identifiers
    +    if data['name']['given-names']:
     TypeError: 'NoneType' object is not subscriptable
    -
  • - -
  • According to ORCID that identifier’s entire name block is null!

  • + - -

    2018-02-20

    - +

    2018-02-20

    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > 2018-02-20-combined.txt
    -
    - -
  • I updated the resolve-orcids.py to use the “credit-name” if it exists in a profile, falling back to “given-names” + “family-name”

  • - -
  • Also, I added color coded output to the debug messages and added a “quiet” mode that supresses the normal behavior of printing results to the screen

  • - -
  • I’m using this as the test input for resolve-orcids.py:

    - +
    $ cat orcid-test-values.txt 
     # valid identifier with 'given-names' and 'family-name'
     0000-0001-5019-1368
    @@ -924,164 +744,122 @@ TypeError: 'NoneType' object is not subscriptable
     
     # missing ORCID identifier
     0000-0003-4221-3214
    -
  • - -
  • Help debug issues with Altmetric badges again, it looks like Altmetric is all kinds of fucked up

  • - -
  • Last week I pointed out that they were tracking Handles from our test server

  • - -
  • Now, their API is responding with content that is marked as content-type JSON but is not valid JSON

  • - -
  • For example, this item: https://cgspace.cgiar.org/handle/10568/83320

  • - -
  • The Altmetric JavaScript builds the following API call: https://api.altmetric.com/v1/handle/10568/83320?callback=_altmetric.embed_callback&domain=cgspace.cgiar.org&key=3c130976ca2b8f2e88f8377633751ba1&cache_until=13-20

  • - -
  • The response body is not JSON

  • - -
  • To contrast, the following bare API call without query parameters is valid JSON: https://api.altmetric.com/v1/handle/10568/83320

  • - -
  • I told them that it’s their JavaScript that is fucked up

  • - -
  • Remove CPWF project number and Humidtropics subject from submission form (#3)

  • - -
  • I accidentally merged it into my own repository, oops

  • + - -

    2018-02-22

    - +

    2018-02-22

    # cat /var/log/nginx/*.log | grep -E "22/Feb/2018:13" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    - 55 192.99.39.235
    - 60 207.46.13.26
    - 62 40.77.167.38
    - 65 207.46.13.23
    -103 41.57.108.208
    -120 104.196.152.243
    -133 104.154.216.0
    -145 68.180.228.117
    -159 54.92.197.82
    -231 5.9.6.51
    -
    - -
  • Otherwise there was pretty normal traffic the rest of the day:

    - + 55 192.99.39.235 + 60 207.46.13.26 + 62 40.77.167.38 + 65 207.46.13.23 + 103 41.57.108.208 + 120 104.196.152.243 + 133 104.154.216.0 + 145 68.180.228.117 + 159 54.92.197.82 + 231 5.9.6.51 +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "22/Feb/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -839 216.244.66.245
    -1074 68.180.228.117
    -1114 157.55.39.100
    -1162 207.46.13.26
    -1178 207.46.13.23
    -2749 104.196.152.243
    -3109 50.116.102.77
    -4199 70.32.83.92
    -5208 5.9.6.51
    -8686 45.5.184.196
    -
  • - -
  • So I don’t see any definite cause for this crash, I see a shit ton of abandoned PostgreSQL connections today around 1PM!

    - + 839 216.244.66.245 + 1074 68.180.228.117 + 1114 157.55.39.100 + 1162 207.46.13.26 + 1178 207.46.13.23 + 2749 104.196.152.243 + 3109 50.116.102.77 + 4199 70.32.83.92 + 5208 5.9.6.51 + 8686 45.5.184.196 +
    # grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' /var/log/tomcat7/catalina.out
     729
     # grep 'Feb 22, 2018 1' /var/log/tomcat7/catalina.out | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' 
     519
    -
  • - -
  • I think the removeAbandonedTimeout might still be too low (I increased it from 60 to 90 seconds last week)

  • - -
  • Abandoned connections is not a cause but a symptom, though perhaps something more like a few minutes is better?

  • - -
  • Also, while looking at the logs I see some new bot:

    - -
    Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.4.2661.102 Safari/537.36; 360Spider
    -
  • - -
  • It seems to re-use its user agent but makes tons of useless requests and I wonder if I should add “.spider.” to the Tomcat Crawler Session Manager valve?

  • + - -

    2018-02-23

    - +
    Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.4.2661.102 Safari/537.36; 360Spider
    +
    +

    2018-02-23

    - -

    2018-02-25

    - +

    2018-02-25

    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l          
     988
    -
    - -
  • After adding the ones from CCAFS we now have 1004:

    - +
    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ccafs | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
     1004
    -
  • - -
  • I will add them to DSpace Test but Abenet says she’s still waiting to set us ILRI’s list

  • - -
  • I will tell her that we should proceed on sharing our work on DSpace Test with the partners this week anyways and we can update the list later

  • - -
  • While regenerating the names for these ORCID identifiers I saw one that has a weird value for its names:

    - +
    Looking up the names associated with ORCID iD: 0000-0002-2614-426X
     Given Names Deactivated Family Name Deactivated: 0000-0002-2614-426X
    -
  • - -
  • I don’t know if the user accidentally entered this as their name or if that’s how ORCID behaves when the name is private?

  • - -
  • I will remove that one from our list for now

  • - -
  • Remove Dryland Systems subject from submission form because that CRP closed two years ago (#355)

  • - -
  • Run all system updates on DSpace Test

  • - -
  • Email ICT to ask how to proceed with the OCS proforma issue for the new DSpace Test server on Linode

  • - -
  • Thinking about how to preserve ORCID identifiers attached to existing items in CGSpace

  • - -
  • We have over 60,000 unique author + authority combinations on CGSpace:

    - +
    dspace=# select count(distinct (text_value, authority)) from metadatavalue where resource_type_id=2 and metadata_field_id=3;
    -count 
    + count 
     -------
    -62464
    + 62464
     (1 row)
    -
  • - -
  • I know from earlier this month that there are only 624 unique ORCID identifiers in the Solr authority core, so it’s way easier to just fetch the unique ORCID iDs from Solr and then go back to PostgreSQL and do the metadata mapping that way

  • - -
  • The query in Solr would simply be orcid_id:*

  • - -
  • Assuming I know that authority record with id:d7ef744b-bbd4-4171-b449-00e37e1b776f, then I could query PostgreSQL for all metadata records using that authority:

    - +
    dspace=# select * from metadatavalue where resource_type_id=2 and authority='d7ef744b-bbd4-4171-b449-00e37e1b776f';
    -metadata_value_id | resource_id | metadata_field_id |        text_value         | text_lang | place |              authority               | confidence | resource_type_id 
    + metadata_value_id | resource_id | metadata_field_id |        text_value         | text_lang | place |              authority               | confidence | resource_type_id 
     -------------------+-------------+-------------------+---------------------------+-----------+-------+--------------------------------------+------------+------------------
    -       2726830 |       77710 |                 3 | Rodríguez Chalarca, Jairo |           |     2 | d7ef744b-bbd4-4171-b449-00e37e1b776f |        600 |                2
    +           2726830 |       77710 |                 3 | Rodríguez Chalarca, Jairo |           |     2 | d7ef744b-bbd4-4171-b449-00e37e1b776f |        600 |                2
     (1 row)
    -
  • - -
  • Then I suppose I can use the resource_id to identify the item?

  • - -
  • Actually, resource_id is the same id we use in CSV, so I could simply build something like this for a metadata import!

    - +
    id,cg.creator.id
     93848,Alan S. Orth: 0000-0002-1735-7458||Peter G. Ballantyne: 0000-0001-9346-2893
    -
  • - -
  • I just discovered that requests-cache can transparently cache HTTP requests

  • - -
  • Running resolve-orcids.py with my test input takes 10.5 seconds the first time, and then 3.0 seconds the second time!

    - +
    $ time ./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names
     Ali Ramadhan: 0000-0001-5019-1368
     Alan S. Orth: 0000-0002-1735-7458
    @@ -1094,103 +872,81 @@ Alan S. Orth: 0000-0002-1735-7458
     Ibrahim Mohammed: 0000-0001-5199-5528
     Nor Azwadi: 0000-0001-9634-1958
     ./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names  0.23s user 0.05s system 8% cpu 3.046 total
    -
  • - - -

    2018-02-26

    - +

    2018-02-26

    - -

    2018-02-27

    - +

    2018-02-27

    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    -  5 dspaceApi
    -279 dspaceWeb
    +      5 dspaceApi
    +    279 dspaceWeb
     $ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c "idle in transaction"
     218
    -
    - -
  • So I’m re-enabling the removeAbandoned setting

  • - -
  • I grabbed a snapshot of the active connections in pg_stat_activity for all queries running longer than 2 minutes:

    - +
    dspace=# \copy (SELECT now() - query_start as "runtime", application_name, usename, datname, waiting, state, query
    -FROM  pg_stat_activity
    -WHERE now() - query_start > '2 minutes'::interval
    -ORDER BY runtime DESC) to /tmp/2018-02-27-postgresql.txt
    +  FROM  pg_stat_activity
    +  WHERE now() - query_start > '2 minutes'::interval
    + ORDER BY runtime DESC) to /tmp/2018-02-27-postgresql.txt
     COPY 263
    -
  • - -
  • 100 of these idle in transaction connections are the following query:

    - +
    SELECT * FROM resourcepolicy WHERE resource_type_id= $1 AND resource_id= $2 AND action_id= $3
    -
  • - -
  • … but according to the pg_locks documentation I should have done this to correlate the locks with the activity:

    - +
    SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;
    -
  • - -
  • Tom Desair from Atmire shared some extra JDBC pool parameters that might be useful on my thread on the dspace-tech mailing list:

    - + - -

    2018-02-28

    - +
  • +
  • I will try with abandonWhenPercentageFull='50'
  • +
  • Also there are some indexes proposed in DS-3636 that he urged me to try
  • +
  • Finally finished the orcid-authority-to-item.py script!
  • +
  • It successfully mapped 2600 ORCID identifiers to items in my tests
  • +
  • I will run it on DSpace Test
  • + +

    2018-02-28

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "28/Feb/2018:09:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    - 65 197.210.168.174
    - 74 213.55.99.121
    - 74 66.249.66.90
    - 86 41.204.190.40
    -102 130.225.98.207
    -108 192.0.89.192
    -112 157.55.39.218
    -129 207.46.13.21
    -131 207.46.13.115
    -135 207.46.13.101
    -
    - -
  • Looking in dspace.log-2018-02-28 I see this, though:

    - + 65 197.210.168.174 + 74 213.55.99.121 + 74 66.249.66.90 + 86 41.204.190.40 + 102 130.225.98.207 + 108 192.0.89.192 + 112 157.55.39.218 + 129 207.46.13.21 + 131 207.46.13.115 + 135 207.46.13.101 +
    2018-02-28 09:19:29,692 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
     org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
    -
  • - -
  • Memory issues seem to be common this month:

    - +
    $ grep -c 'nested exception is java.lang.OutOfMemoryError: Java heap space' dspace.log.2018-02-* 
     dspace.log.2018-02-01:0
     dspace.log.2018-02-02:0
    @@ -1220,53 +976,40 @@ dspace.log.2018-02-25:0
     dspace.log.2018-02-26:0
     dspace.log.2018-02-27:6
     dspace.log.2018-02-28:1
    -
  • - -
  • Top ten users by session during the first twenty minutes of 9AM:

    - +
    $ grep -E '2018-02-28 09:(0|1)' dspace.log.2018-02-28 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq -c | sort -n | tail -n 10
    - 18 session_id=F2DFF64D3D707CD66AE3A873CEC80C49
    - 19 session_id=92E61C64A79F0812BE62A3882DA8F4BA
    - 21 session_id=57417F5CB2F9E3871E609CEEBF4E001F
    - 25 session_id=C3CD265AB7AA51A49606C57C069A902A
    - 26 session_id=E395549F081BA3D7A80F174AE6528750
    - 26 session_id=FEE38CF9760E787754E4480069F11CEC
    - 33 session_id=C45C2359AE5CD115FABE997179E35257
    - 38 session_id=1E9834E918A550C5CD480076BC1B73A4
    - 40 session_id=8100883DAD00666A655AE8EC571C95AE
    - 66 session_id=01D9932D6E85E90C2BA9FF5563A76D03
    -
  • - -
  • According to the log 01D9932D6E85E90C2BA9FF5563A76D03 is an ILRI editor, doing lots of updating and editing of items

  • - -
  • 8100883DAD00666A655AE8EC571C95AE is some Indian IP address

  • - -
  • 1E9834E918A550C5CD480076BC1B73A4 looks to be a session shared by the bots

  • - -
  • So maybe it was due to the editor’s uploading of files, perhaps something that was too big or?

  • - -
  • I think I’ll increase the JVM heap size on CGSpace from 6144m to 8192m because I’m sick of this random crashing shit and the server has memory and I’d rather eliminate this so I can get back to solving PostgreSQL issues and doing other real work

  • - -
  • Run the few corrections from earlier this month for sponsor on CGSpace:

    - + 18 session_id=F2DFF64D3D707CD66AE3A873CEC80C49 + 19 session_id=92E61C64A79F0812BE62A3882DA8F4BA + 21 session_id=57417F5CB2F9E3871E609CEEBF4E001F + 25 session_id=C3CD265AB7AA51A49606C57C069A902A + 26 session_id=E395549F081BA3D7A80F174AE6528750 + 26 session_id=FEE38CF9760E787754E4480069F11CEC + 33 session_id=C45C2359AE5CD115FABE997179E35257 + 38 session_id=1E9834E918A550C5CD480076BC1B73A4 + 40 session_id=8100883DAD00666A655AE8EC571C95AE + 66 session_id=01D9932D6E85E90C2BA9FF5563A76D03 +
    cgspace=# update metadatavalue set text_value='United States Agency for International Development' where resource_type_id=2 and metadata_field_id=29 and text_value like '%U.S. Agency for International Development%';
     UPDATE 3
    -
  • - -
  • I finally got a CGIAR account so I logged into CGSpace with it and tried to delete my old unfinished submissions (22 of them)

  • - -
  • Eventually it succeeded, but it took about five minutes and I noticed LOTS of locks happening with this query:

    - +
    dspace=# \copy (SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid) to /tmp/locks-aorth.txt;
    -
  • - -
  • I took a few snapshots during the process and noticed 500, 800, and even 2000 locks at certain times during the process

  • - -
  • Afterwards I looked a few times and saw only 150 or 200 locks

  • - -
  • On the test server, with the PostgreSQL indexes from DS-3636 applied, it finished instantly

  • - -
  • Run system updates on DSpace Test and reboot the server

  • + diff --git a/docs/2018-03/index.html b/docs/2018-03/index.html index b78cb2fe0..9d290101c 100644 --- a/docs/2018-03/index.html +++ b/docs/2018-03/index.html @@ -8,7 +8,6 @@ @@ -20,10 +19,9 @@ Export a CSV of the IITA community metadata for Martin Mueller - + @@ -104,169 +102,132 @@ Export a CSV of the IITA community metadata for Martin Mueller

    -

    2018-03-02

    - +

    2018-03-02

    - -

    2018-03-06

    - +

    2018-03-06

    $ ./fix-metadata-values.py -i Correct-309-authors-2018-03-06.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3      
     $ ./delete-metadata-values.py -i Delete-3-Authors-2018-03-06.csv -db dspace -u dspace-p 'fuuu' -f dc.contributor.author -m 3
    -
    - -
  • This time there were no errors in whitespace but I did have to correct one incorrectly encoded accent character

  • - -
  • Add new CRP subject “GRAIN LEGUMES AND DRYLAND CEREALS” to input-forms.xml (#358)

  • - -
  • Merge the ORCID integration stuff in to 5_x-prod for deployment on CGSpace soon (#359)

  • - -
  • Deploy ORCID changes on CGSpace (linode18), run all system updates, and reboot the server

  • - -
  • Run all system updates on DSpace Test and reboot server

  • - -
  • I ran the orcid-authority-to-item.py script on CGSpace and mapped 2,864 ORCID identifiers from Solr to item metadata

    - +
    $ ./orcid-authority-to-item.py -db dspace -u dspace -p 'fuuu' -s http://localhost:8081/solr -d
    -
  • - -
  • I ran the DSpace cleanup script on CGSpace and it threw an error (as always):

    - +
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -Detail: Key (bitstream_id)=(150659) is still referenced from table "bundle".
    -
  • - -
  • The solution is, as always:

    - + Detail: Key (bitstream_id)=(150659) is still referenced from table "bundle". +
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (150659);'
     UPDATE 1
    -
  • - -
  • Apply the proposed PostgreSQL indexes from DS-3636 (pull request #1791 on CGSpace (linode18)

  • + - -

    2018-03-07

    - +

    2018-03-07

    - -

    2018-03-08

    - +

    2018-03-08

    dspace=# select distinct text_lang from metadatavalue where resource_type_id=2;
    -text_lang 
    + text_lang 
     -----------
      
    -ethnob
    -en
    -spa
    -EN
    -En
    -en_
    -en_US
    -E.
    + ethnob
    + en
    + spa
    + EN
    + En
    + en_
    + en_US
    + E.
      
    -EN_US
    -en_U
    -eng
    -fr
    -es_ES
    -es
    + EN_US
    + en_U
    + eng
    + fr
    + es_ES
    + es
     (16 rows)
     
     dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('en','EN','En','en_','EN_US','en_U','eng');
     UPDATE 122227
     dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2;
    -text_lang
    + text_lang
     -----------
     
    -ethnob
    -en_US
    -spa
    -E.
    + ethnob
    + en_US
    + spa
    + E.
     
    -fr
    -es_ES
    -es
    + fr
    + es_ES
    + es
     (9 rows)
    -
    - -
  • On second inspection it looks like dc.description.provenance fields use the text_lang “en” so that’s probably why there are over 100,000 fields changed…

  • - -
  • If I skip that, there are about 2,000, which seems more reasonably like the amount of fields users have edited manually, or fucked up during CSV import, etc:

    - +
    dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('EN','En','en_','EN_US','en_U','eng');
     UPDATE 2309
    -
  • - -
  • I will apply this on CGSpace right now

  • - -
  • In other news, I was playing with adding ORCID identifiers to a dump of CIAT’s community via CSV in OpenRefine

  • - -
  • Using a series of filters, flags, and GREL expressions to isolate items for a certain author, I figured out how to add ORCID identifiers to the cg.creator.id field

  • - -
  • For example, a GREL expression in a custom text facet to get all items with dc.contributor.author[en_US] of a certain author with several name variations (this is how you use a logical OR in OpenRefine):

    - +
    or(value.contains('Ceballos, Hern'), value.contains('Hernández Ceballos'))
    -
  • - -
  • Then you can flag or star matching items and then use a conditional to either set the value directly or add it to an existing value:

    - +
    if(isBlank(value), "Hernan Ceballos: 0000-0002-8744-7918", value + "||Hernan Ceballos: 0000-0002-8744-7918")
    -
  • - -
  • One thing that bothers me is that this won’t honor author order

  • - -
  • It might be better to do batches of these in PostgreSQL with a script that takes the place column of an author into account when setting the cg.creator.id

  • - -
  • I wrote a Python script to read the author names and ORCID identifiers from CSV and create matching cg.creator.id fields: add-orcid-identifiers-csv.py

  • - -
  • The CSV should have two columns: author name and ORCID identifier:

    - +
    dc.contributor.author,cg.creator.id
     "Orth, Alan",Alan S. Orth: 0000-0002-1735-7458
     "Orth, A.",Alan S. Orth: 0000-0002-1735-7458
    -
  • - -
  • I didn’t integrate the ORCID API lookup for author names in this script for now because I was only interested in “tagging” old items for a few given authors

  • - -
  • I added ORCID identifers for 187 items by CIAT’s Hernan Ceballos, because that is what Elizabeth was trying to do manually!

  • - -
  • Also, I decided to add ORCID identifiers for all records from Peter, Abenet, and Sisay as well

  • + - -

    2018-03-09

    - +

    2018-03-09

    - -

    2018-03-11

    - +

    2018-03-11

    2018-03-11 11:38:15,592 WARN  org.dspace.app.webui.servlet.InternalErrorServlet @ :session_id=91C2C0C59669B33A7683570F6010603A:internal_error:-- URL Was: https://cgspace.cgiar.or
     g/jspui/listings-and-reports
     -- Method: POST
    @@ -277,21 +238,15 @@ g/jspui/listings-and-reports
     -- step: "1"
     
     org.apache.jasper.JasperException: java.lang.NullPointerException
    -
    - -
  • Looks like I needed to remove the Humidtropics subject from Listings and Reports because it was looking for the terms and couldn’t find them

  • - -
  • I made a quick fix and it’s working now (#364)

  • + - -

    2018-03-12

    - +

    2018-03-12

    - -

    2018-03-13

    - +

    2018-03-13

    - -

    2018-03-14

    - +

    2018-03-14

    - -

    2018-03-15

    - +

    2018-03-15

    - -

    Listing and Reports layout

    - +

    Listing and Reports layout

    - -

    2018-03-16

    - +
    org.apache.jasper.JasperException: java.lang.ArrayIndexOutOfBoundsException: -1
    +
    +

    2018-03-16

    dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=230 and text_value='';
    -
    - -
  • Copy all CRP subjects to a CSV to do the mass updates:

    - +
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=230 group by text_value order by count desc) to /tmp/crps.csv with csv header;
     COPY 21
    -
  • - -
  • Once I prepare the new input forms (#362) I will need to do the batch corrections:

    - -
    $ ./fix-metadata-values.py -i Correct-21-CRPs-2018-03-16.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.crp -t correct -m 230 -n -d
    -
  • - -
  • Create a pull request to update the input forms for the new CRP subject style (#366)

  • + - -

    2018-03-19

    - +
    $ ./fix-metadata-values.py -i Correct-21-CRPs-2018-03-16.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.crp -t correct -m 230 -n -d
    +
    +

    2018-03-19

    2018-03-19 09:10:54,856 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
     ...
     2018-03-19 09:10:54,862 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL query singleTable Error -
    -
    - -
  • But these errors, I don’t even know what they mean, because a handful of them happen every day:

    - +
    $ grep -c 'ERROR org.dspace.storage.rdbms.DatabaseManager' dspace.log.2018-03-1*
     dspace.log.2018-03-10:13
     dspace.log.2018-03-11:15
    @@ -390,287 +325,220 @@ dspace.log.2018-03-16:13
     dspace.log.2018-03-17:13
     dspace.log.2018-03-18:15
     dspace.log.2018-03-19:90
    -
  • - -
  • There wasn’t even a lot of traffic at the time (8–9 AM):

    - +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "19/Mar/2018:0[89]:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    - 92 40.77.167.197
    - 92 83.103.94.48
    - 96 40.77.167.175
    -116 207.46.13.178
    -122 66.249.66.153
    -140 95.108.181.88
    -196 213.55.99.121
    -206 197.210.168.174
    -207 104.196.152.243
    -294 54.198.169.202
    -
  • - -
  • Well there is a hint in Tomcat’s catalina.out:

    - + 92 40.77.167.197 + 92 83.103.94.48 + 96 40.77.167.175 + 116 207.46.13.178 + 122 66.249.66.153 + 140 95.108.181.88 + 196 213.55.99.121 + 206 197.210.168.174 + 207 104.196.152.243 + 294 54.198.169.202 +
    Mon Mar 19 09:05:28 UTC 2018 | Query:id: 92032 AND type:2
     Exception in thread "http-bio-127.0.0.1-8081-exec-280" java.lang.OutOfMemoryError: Java heap space
    -
  • - -
  • So someone was doing something heavy somehow… my guess is content and usage stats!

  • - -
  • ICT responded that they “fixed” the CGSpace connectivity issue in Nairobi without telling me the problem

  • - -
  • When I asked, Robert Okal said CGNET messed up when updating the DNS for cgspace.cgiar.org last week

  • - -
  • I told him that my request last week was for dspacetest.cgiar.org, not cgspace.cgiar.org!

  • - -
  • So they updated the wrong fucking DNS records

  • - -
  • Magdalena from CCAFS wrote to ask about one record that has a bunch of metadata missing in her Listings and Reports export

  • - -
  • It appears to be this one: https://cgspace.cgiar.org/handle/10568/83473?show=full

  • - -
  • The title is “Untitled” and there is some metadata but indeed the citation is missing

  • - -
  • I don’t know what would cause that

  • + - -

    2018-03-20

    - +

    2018-03-20

    2018-03-20 08:47:10,177 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
     ...
     2018-03-20 08:53:11,624 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
     org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
    -
    - -
  • I have no idea why it crashed

  • - -
  • I ran all system updates and rebooted it

  • - -
  • Abenet told me that one of Lance Robinson’s ORCID iDs on CGSpace is incorrect

  • - -
  • I will remove it from the controlled vocabulary (#367) and update any items using the old one:

    - +
    dspace=# update metadatavalue set text_value='Lance W. Robinson: 0000-0002-5224-8644' where resource_type_id=2 and metadata_field_id=240 and text_value like '%0000-0002-6344-195X%';
     UPDATE 1
    -
  • - -
  • Communicate with DSpace editors on Yammer about being more careful about spaces and character editing when doing manual metadata edits

  • - -
  • Merge the changes to CRP names to the 5_x-prod branch and deploy on CGSpace (#363)

  • - -
  • Run corrections for CRP names in the database:

    - +
    $ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
    -
  • - -
  • Run all system updates on CGSpace (linode18) and reboot the server

  • - -
  • I started a full Discovery re-index on CGSpace because of the updated CRPs

  • - -
  • I see this error in the DSpace log:

    - +
    2018-03-20 19:03:14,844 ERROR com.atmire.dspace.discovery.AtmireSolrService @ No choices plugin was configured for  field "dc_contributor_author".
     java.lang.IllegalArgumentException: No choices plugin was configured for  field "dc_contributor_author".
    -    at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:261)
    -    at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:249)
    -    at org.dspace.browse.SolrBrowseCreateDAO.additionalIndex(SolrBrowseCreateDAO.java:215)
    -    at com.atmire.dspace.discovery.AtmireSolrService.buildDocument(AtmireSolrService.java:662)
    -    at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:807)
    -    at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:876)
    -    at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370)
    -    at org.dspace.discovery.IndexClient.main(IndexClient.java:117)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -    at java.lang.reflect.Method.invoke(Method.java:498)
    -    at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    -    at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    -
  • - -
  • I have to figure that one out…

  • + at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:261) + at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:249) + at org.dspace.browse.SolrBrowseCreateDAO.additionalIndex(SolrBrowseCreateDAO.java:215) + at com.atmire.dspace.discovery.AtmireSolrService.buildDocument(AtmireSolrService.java:662) + at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:807) + at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:876) + at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370) + at org.dspace.discovery.IndexClient.main(IndexClient.java:117) + at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) + at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) + at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) + at java.lang.reflect.Method.invoke(Method.java:498) + at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226) + at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78) + - -

    2018-03-21

    - +

    2018-03-21

    +
    dspace=# UPDATE metadatavalue SET authority=NULL WHERE resource_type_id=2 AND metadata_field_id=3 AND authority IS NOT NULL;
    +UPDATE 195463
    +
    +
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv header;
    +COPY 56156
    +
    2018-03-21 15:11:08,166 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error - 
     java.sql.SQLException: Connection has already been closed.
    -
    - -
  • I have no idea why so many connections were abandoned this afternoon:

    - +
    # grep 'Mar 21, 2018' /var/log/tomcat7/catalina.out | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
     268
    -
  • - -
  • DSpace Test crashed again due to Java heap space, this is from the DSpace log:

    - +
    2018-03-21 15:18:48,149 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
     org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
    -
  • - -
  • And this is from the Tomcat Catalina log:

    - +
    Mar 21, 2018 11:20:00 AM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run
     SEVERE: Unexpected death of background thread ContainerBackgroundProcessor[StandardEngine[Catalina]]
     java.lang.OutOfMemoryError: Java heap space
    -
  • - -
  • But there are tons of heap space errors on DSpace Test actually:

    - +
    # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
     319
    -
  • - -
  • I guess we need to give it more RAM because it now has CGSpace’s large Solr core

  • - -
  • I will increase the memory from 3072m to 4096m

  • - -
  • Update Ansible playbooks to use PostgreSQL JBDC driver 42.2.2

  • - -
  • Deploy the new JDBC driver on DSpace Test

  • - -
  • I’m also curious to see how long the dspace index-discovery -b takes on DSpace Test where the DSpace installation directory is on one of Linode’s new block storage volumes

    - +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    208m19.155s
     user    8m39.138s
     sys     2m45.135s
    -
  • - -
  • So that’s about three times as long as it took on CGSpace this morning

  • - -
  • I should also check the raw read speed with hdparm -tT /dev/sdc

  • - -
  • Looking at Peter’s author corrections there are some mistakes due to Windows 1252 encoding

  • - -
  • I need to find a way to filter these easily with OpenRefine

  • - -
  • For example, Peter has inadvertantly introduced Unicode character 0xfffd into several fields

  • - -
  • I can search for Unicode values by their hex code in OpenRefine using the following GREL expression:

    - -
    isNotNull(value.match(/.*\ufffd.*/))
    -
  • - -
  • I need to be able to add many common characters though so that it is useful to copy and paste into a new project to find issues

  • + - -

    2018-03-22

    - +
    isNotNull(value.match(/.*\ufffd.*/))
    +
    +

    2018-03-22

    - -

    2018-03-24

    - +

    2018-03-24

    - -

    2018-03-25

    - +

    2018-03-25

    isNotNull(value.match(/.*[a-zA-ZáÁéèïíñØøöóúü].*/))
    -
    - -
  • But it’s probably better to just say which characters I know for sure are not valid (like parentheses, pipe, or weird Unicode characters):

    - +
    or(
    -isNotNull(value.match(/.*[(|)].*/)),
    -isNotNull(value.match(/.*\uFFFD.*/)),
    -isNotNull(value.match(/.*\u00A0.*/)),
    -isNotNull(value.match(/.*\u200A.*/))
    +  isNotNull(value.match(/.*[(|)].*/)),
    +  isNotNull(value.match(/.*\uFFFD.*/)),
    +  isNotNull(value.match(/.*\u00A0.*/)),
    +  isNotNull(value.match(/.*\u200A.*/))
     )
    -
  • - -
  • And here’s one combined GREL expression to check for items marked as to delete or check so I can flag them and export them to a separate CSV (though perhaps it’s time to add delete support to my fix-metadata-values.py script:

    - +
    or(
    -isNotNull(value.match(/.*delete.*/i)),
    -isNotNull(value.match(/.*remove.*/i)),
    -isNotNull(value.match(/.*check.*/i))
    +  isNotNull(value.match(/.*delete.*/i)),
    +  isNotNull(value.match(/.*remove.*/i)),
    +  isNotNull(value.match(/.*check.*/i))
     )
    -
  • - -
  • So I guess the routine is in OpenRefine is:

    - + +
  • +
  • +

    Test the corrections and deletions locally, then run them on CGSpace:

    +
  • +
    $ ./fix-metadata-values.py -i /tmp/Correct-2928-Authors-2018-03-21.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
     $ ./delete-metadata-values.py -i /tmp/Delete-8-Authors-2018-03-21.csv -f dc.contributor.author -m 3 -db dspacetest -u dspace -p 'fuuu'
    -
    - -
  • Afterwards I started a full Discovery reindexing on both CGSpace and DSpace Test

  • - -
  • CGSpace took 76m28.292s

  • - -
  • DSpace Test took 194m56.048s

  • + - -

    2018-03-26

    - +

    2018-03-26

    - -

    2018-03-27

    - +

    2018-03-27

    - -

    2018-03-28

    - +

    2018-03-28

    Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java heap space
    -
    - -
  • Add ISI Journal (cg.isijournal) as an option in Atmire’s Listing and Reports layout (#370) for Abenet

  • - -
  • I noticed a few hundred CRPs using the old capitalized formatting so I corrected them:

    - +
    $ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db cgspace -u cgspace -p 'fuuu'
     Fixed 29 occurences of: CLIMATE CHANGE, AGRICULTURE AND FOOD SECURITY
     Fixed 7 occurences of: WATER, LAND AND ECOSYSTEMS
    @@ -682,17 +550,12 @@ Fixed 11 occurences of: POLICIES, INSTITUTIONS, AND MARKETS
     Fixed 28 occurences of: GRAIN LEGUMES
     Fixed 3 occurences of: FORESTS, TREES AND AGROFORESTRY
     Fixed 5 occurences of: GENEBANKS
    -
  • - -
  • That’s weird because we just updated them last week…

  • - -
  • Create a pull request to enable searching by ORCID identifier (cg.creator.id) in Discovery and Listings and Reports (#371)

  • - -
  • I will test it on DSpace Test first!

  • - -
  • Fix one missing XMLUI string for “Access Status” (cg.identifier.status)

  • - -
  • Run all system updates on DSpace Test and reboot the machine

  • + diff --git a/docs/2018-04/index.html b/docs/2018-04/index.html index 3706c2407..1b438add5 100644 --- a/docs/2018-04/index.html +++ b/docs/2018-04/index.html @@ -8,8 +8,7 @@ @@ -21,11 +20,10 @@ Catalina logs at least show some memory errors yesterday: - + @@ -106,171 +104,131 @@ Catalina logs at least show some memory errors yesterday:

    -

    2018-04-01

    - +

    2018-04-01

    -
    Mar 31, 2018 10:26:42 PM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run
     SEVERE: Unexpected death of background thread ContainerBackgroundProcessor[StandardEngine[Catalina]] 
     java.lang.OutOfMemoryError: Java heap space
     
     Exception in thread "ContainerBackgroundProcessor[StandardEngine[Catalina]]" java.lang.OutOfMemoryError: Java heap space
    -
    - - +

    2018-04-05

    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    599m32.961s
     user    9m3.947s
     sys     2m52.585s
    -
    - -
  • So we really should not use this Linode block storage for Solr

  • - -
  • Assetstore might be fine but would complicate things with configuration and deployment (ughhh)

  • - -
  • Better to use Linode block storage only for backup

  • - -
  • Help Peter with the GDPR compliance / reporting form for CGSpace

  • - -
  • DSpace Test crashed due to memory issues again:

    - +
    # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
     16
    -
  • - -
  • I ran all system updates on DSpace Test and rebooted it

  • - -
  • Proof some records on DSpace Test for Udana from IWMI

  • - -
  • He has done better with the small syntax and consistency issues but then there are larger concerns with not linking to DOIs, copying titles incorrectly, etc

  • + - -

    2018-04-10

    - +

    2018-04-10

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Apr/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10                                                                                                   
    -282 207.46.13.112
    -286 54.175.208.220
    -287 207.46.13.113
    -298 66.249.66.153
    -322 207.46.13.114
    -780 104.196.152.243
    -3994 178.154.200.38
    -4295 70.32.83.92
    -4388 95.108.181.88
    -7653 45.5.186.2
    -
    - -
  • 45.5.186.2 is of course CIAT

  • - -
  • 95.108.181.88 appears to be Yandex:

    - + 282 207.46.13.112 + 286 54.175.208.220 + 287 207.46.13.113 + 298 66.249.66.153 + 322 207.46.13.114 + 780 104.196.152.243 + 3994 178.154.200.38 + 4295 70.32.83.92 + 4388 95.108.181.88 + 7653 45.5.186.2 +
    95.108.181.88 - - [09/Apr/2018:06:34:16 +0000] "GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1" 200 2638 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
    -
  • - -
  • And for some reason Yandex created a lot of Tomcat sessions today:

    - +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-04-10
     4363
    -
  • - -
  • 70.32.83.92 appears to be some harvester we’ve seen before, but on a new IP

  • - -
  • They are not creating new Tomcat sessions so there is no problem there

  • - -
  • 178.154.200.38 also appears to be Yandex, and is also creating many Tomcat sessions:

    - +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=178.154.200.38' dspace.log.2018-04-10
     3982
    -
  • - -
  • I’m not sure why Yandex creates so many Tomcat sessions, as its user agent should match the Crawler Session Manager valve

  • - -
  • Let’s try a manual request with and without their user agent:

    - +
    $ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg 'User-Agent:Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)'
     GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1
     Accept: */*
    @@ -319,385 +277,291 @@ X-Cocoon-Version: 2.2.0
     X-Content-Type-Options: nosniff
     X-Frame-Options: SAMEORIGIN
     X-XSS-Protection: 1; mode=block
    -
  • - -
  • So it definitely looks like Yandex requests are getting assigned a session from the Crawler Session Manager valve

  • - -
  • And if I look at the DSpace log I see its IP sharing a session with other crawlers like Google (66.249.66.153)

  • - -
  • Indeed the number of Tomcat sessions appears to be normal:

  • + - -

    Tomcat sessions week

    - +

    Tomcat sessions week

    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Mar/2018"
     2266594
     
     real    0m13.658s
     user    0m16.533s
     sys     0m1.087s
    -
    - -
  • In other other news, the database cleanup script has an issue again:

    - +
    $ dspace cleanup -v
     ...
     Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -Detail: Key (bitstream_id)=(151626) is still referenced from table "bundle".
    -
  • - -
  • The solution is, as always:

    - + Detail: Key (bitstream_id)=(151626) is still referenced from table "bundle". +
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (151626);'
     UPDATE 1
    -
  • - -
  • Looking at abandoned connections in Tomcat:

    - +
    # zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
     2115
    -
  • - -
  • Apparently from these stacktraces we should be able to see which code is not closing connections properly

  • - -
  • Here’s a pretty good overview of days where we had database issues recently:

    - -
    # zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' | awk '{print $1,$2, $3}' | sort | uniq -c | sort -n
    -  1 Feb 18, 2018
    -  1 Feb 19, 2018
    -  1 Feb 20, 2018
    -  1 Feb 24, 2018
    -  2 Feb 13, 2018
    -  3 Feb 17, 2018
    -  5 Feb 16, 2018
    -  5 Feb 23, 2018
    -  5 Feb 27, 2018
    -  6 Feb 25, 2018
    - 40 Feb 14, 2018
    - 63 Feb 28, 2018
    -154 Mar 19, 2018
    -202 Feb 21, 2018
    -264 Feb 26, 2018
    -268 Mar 21, 2018
    -524 Feb 22, 2018
    -570 Feb 15, 2018
    -
  • - -
  • In Tomcat 8.5 the removeAbandoned property has been split into two: removeAbandonedOnBorrow and removeAbandonedOnMaintenance

  • - -
  • See: https://tomcat.apache.org/tomcat-8.5-doc/jndi-datasource-examples-howto.html#Database_Connection_Pool_(DBCP_2)_Configurations

  • - -
  • I assume we want removeAbandonedOnBorrow and make updates to the Tomcat 8 templates in Ansible

  • - -
  • After reading more documentation I see that Tomcat 8.5’s default DBCP seems to now be Commons DBCP2 instead of Tomcat DBCP

  • - -
  • It can be overridden in Tomcat’s server.xml by setting factory="org.apache.tomcat.jdbc.pool.DataSourceFactory" in the <Resource>

  • - -
  • I think we should use this default, so we’ll need to remove some other settings that are specific to Tomcat’s DBCP like jdbcInterceptors and abandonWhenPercentageFull

  • - -
  • Merge the changes adding ORCID identifier to advanced search and Atmire Listings and Reports (#371)

  • - -
  • Fix one more issue of missing XMLUI strings (for CRP subject when clicking “view more” in the Discovery sidebar)

  • - -
  • I told Udana to fix the citation and abstract of the one item, and to correct the dc.language.iso for the five Spanish items in his Book Chapters collection

  • - -
  • Then we can import the records to CGSpace

  • + - -

    2018-04-11

    - +
    # zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' | awk '{print $1,$2, $3}' | sort | uniq -c | sort -n
    +      1 Feb 18, 2018
    +      1 Feb 19, 2018
    +      1 Feb 20, 2018
    +      1 Feb 24, 2018
    +      2 Feb 13, 2018
    +      3 Feb 17, 2018
    +      5 Feb 16, 2018
    +      5 Feb 23, 2018
    +      5 Feb 27, 2018
    +      6 Feb 25, 2018
    +     40 Feb 14, 2018
    +     63 Feb 28, 2018
    +    154 Mar 19, 2018
    +    202 Feb 21, 2018
    +    264 Feb 26, 2018
    +    268 Mar 21, 2018
    +    524 Feb 22, 2018
    +    570 Feb 15, 2018
    +
    +

    2018-04-11

    # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
     168
    -
    - -
  • I ran all system updates and rebooted the server

  • + - -

    2018-04-12

    - +

    2018-04-12

    - -

    2018-04-13

    - +

    2018-04-13

    - -

    2018-04-15

    - +

    2018-04-15

    2018-04-14 18:55:25,841 ERROR org.dspace.authority.AuthoritySolrServiceImpl @ Authority solr is not correctly configured, check "solr.authority.server" property in the dspace.cfg
     java.lang.NullPointerException
    -
    - -
  • I assume we need to remove authority from the consumers in dspace/config/dspace.cfg:

    - +
    event.dispatcher.default.consumers = authority, versioning, discovery, eperson, harvester, statistics,batchedit, versioningmqm
    -
  • - -
  • I see the same error on DSpace Test so this is definitely a problem

  • - -
  • After disabling the authority consumer I no longer see the error

  • - -
  • I merged a pull request to the 5_x-prod branch to clean that up (#372)

  • - -
  • File a ticket on DSpace’s Jira for the target="_blank" security and performance issue (DS-3891)

  • - -
  • I re-deployed DSpace Test (linode19) and was surprised by how long it took the ant update to complete:

    - +
    BUILD SUCCESSFUL
     Total time: 4 minutes 12 seconds
    -
  • - -
  • The Linode block storage is much slower than the instance storage

  • - -
  • I ran all system updates and rebooted DSpace Test (linode19)

  • + - -

    2018-04-16

    - +

    2018-04-16

    - -

    2018-04-18

    - +

    2018-04-18

    webui.itemlist.sort-option.1 = title:dc.title:title
     webui.itemlist.sort-option.2 = dateissued:dc.date.issued:date
     webui.itemlist.sort-option.3 = dateaccessioned:dc.date.accessioned:date
     webui.itemlist.sort-option.4 = type:dc.type:text
    -
    - -
  • They want items by issue date, so we need to use sort option 2

  • - -
  • According to the DSpace Manual there are only the following parameters to OpenSearch: format, scope, rpp, start, and sort_by

  • - -
  • The OpenSearch query parameter expects a Discovery search filter that is defined in dspace/config/spring/api/discovery.xml

  • - -
  • So for IWMI they should be able to use something like this: https://cgspace.cgiar.org/open-search/discover?query=dateIssued:2018&scope=10568/16814&sort_by=2&order=DESC&format=rss

  • - -
  • There are also rpp (results per page) and start parameters but in my testing now on DSpace 5.5 they behave very strangely

  • - -
  • For example, set rpp=1 and then check the results for start values of 0, 1, and 2 and they are all the same!

  • - -
  • If I have time I will check if this behavior persists on DSpace 6.x on the official DSpace demo and file a bug

  • - -
  • Also, the DSpace Manual as of 5.x has very poor documentation for OpenSearch

  • - -
  • They don’t tell you to use Discovery search filters in the query (with format query=dateIssued:2018)

  • - -
  • They don’t tell you that the sort options are actually defined in dspace.cfg (ie, you need to use 2 instead of dc.date.issued_dt)

  • - -
  • They are missing the order parameter (ASC vs DESC)

  • - -
  • I notice that DSpace Test has crashed again, due to memory:

    - +
    # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
     178
    -
  • - -
  • I will increase the JVM heap size from 5120M to 6144M, though we don’t have much room left to grow as DSpace Test (linode19) is using a smaller instance size than CGSpace

  • - -
  • Gabriela from CIP asked if I could send her a list of all CIP authors so she can do some replacements on the name formats

  • - -
  • I got a list of all the CIP collections manually and use the same query that I used in August, 2017:

    - -
    dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/89347', '10568/88229', '10568/53086', '10568/53085', '10568/69069', '10568/53087', '10568/53088', '10568/53089', '10568/53090', '10568/53091', '10568/53092', '10568/70150', '10568/53093', '10568/64874', '10568/53094'))) group by text_value order by count desc) to /tmp/cip-authors.csv with csv;
    -
  • + - -

    2018-04-19

    - +
    dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/89347', '10568/88229', '10568/53086', '10568/53085', '10568/69069', '10568/53087', '10568/53088', '10568/53089', '10568/53090', '10568/53091', '10568/53092', '10568/70150', '10568/53093', '10568/64874', '10568/53094'))) group by text_value order by count desc) to /tmp/cip-authors.csv with csv;
    +

    2018-04-19

    $ ant update update_geolite clean_backups
    -
    - -
  • I also re-deployed CGSpace (linode18) to make the ORCID search, authority cleanup, CCAFS project tag PII-LAM_CSAGender live

  • - -
  • When re-deploying I also updated the GeoLite databases so I hope the country stats become more accurate…

  • - -
  • After re-deployment I ran all system updates on the server and rebooted it

  • - -
  • After the reboot I forced a reïndexing of the Discovery to populate the new ORCID index:

    - +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    73m42.635s
     user    8m15.885s
     sys     2m2.687s
    -
  • - -
  • This time is with about 70,000 items in the repository

  • + - -

    2018-04-20

    - +

    2018-04-20

    org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-715] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:18; idle:0; lastwait:5000].
    -
    - -
  • And there have been shit tons of errors in the last (starting only 20 minutes ago luckily):

    - +
    # grep -c 'org.apache.tomcat.jdbc.pool.PoolExhaustedException' /home/cgspace.cgiar.org/log/dspace.log.2018-04-20
     32147
    -
  • - -
  • I can’t even log into PostgreSQL as the postgres user, WTF?

    - +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c 
     ^C
    -
  • - -
  • Here are the most active IPs today:

    - +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "20/Apr/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -917 207.46.13.182
    -935 213.55.99.121
    -970 40.77.167.134
    -978 207.46.13.80
    -1422 66.249.64.155
    -1577 50.116.102.77
    -2456 95.108.181.88
    -3216 104.196.152.243
    -4325 70.32.83.92
    -10718 45.5.184.2
    -
  • - -
  • It doesn’t even seem like there is a lot of traffic compared to the previous days:

    - + 917 207.46.13.182 + 935 213.55.99.121 + 970 40.77.167.134 + 978 207.46.13.80 + 1422 66.249.64.155 + 1577 50.116.102.77 + 2456 95.108.181.88 + 3216 104.196.152.243 + 4325 70.32.83.92 + 10718 45.5.184.2 +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "20/Apr/2018" | wc -l
     74931
     # zcat --force /var/log/nginx/*.log.1 /var/log/nginx/*.log.2.gz| grep -E "19/Apr/2018" | wc -l
     91073
     # zcat --force /var/log/nginx/*.log.2.gz /var/log/nginx/*.log.3.gz| grep -E "18/Apr/2018" | wc -l
     93459
    -
  • - -
  • I tried to restart Tomcat but systemctl hangs

  • - -
  • I tried to reboot the server from the command line but after a few minutes it didn’t come back up

  • - -
  • Looking at the Linode console I see that it is stuck trying to shut down

  • - -
  • Even “Reboot” via Linode console doesn’t work!

  • - -
  • After shutting it down a few times via the Linode console it finally rebooted

  • - -
  • Everything is back but I have no idea what caused this—I suspect something with the hosting provider

  • - -
  • Also super weird, the last entry in the DSpace log file is from 2018-04-20 16:35:09, and then immediately it goes to 2018-04-20 19:15:04 (three hours later!):

    - +
    2018-04-20 16:35:09,144 ERROR org.dspace.app.util.AbstractDSpaceWebapp @ Failed to record shutdown in Webapp table.
     org.apache.tomcat.jdbc.pool.PoolExhaustedException: [localhost-startStop-2] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:18; idle
     :0; lastwait:5000].
    -    at org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:685)
    -    at org.apache.tomcat.jdbc.pool.ConnectionPool.getConnection(ConnectionPool.java:187)
    -    at org.apache.tomcat.jdbc.pool.DataSourceProxy.getConnection(DataSourceProxy.java:128)
    -    at org.dspace.storage.rdbms.DatabaseManager.getConnection(DatabaseManager.java:632)
    -    at org.dspace.core.Context.init(Context.java:121)
    -    at org.dspace.core.Context.<init>(Context.java:95)
    -    at org.dspace.app.util.AbstractDSpaceWebapp.deregister(AbstractDSpaceWebapp.java:97)
    -    at org.dspace.app.util.DSpaceContextListener.contextDestroyed(DSpaceContextListener.java:146)
    -    at org.apache.catalina.core.StandardContext.listenerStop(StandardContext.java:5115)
    -    at org.apache.catalina.core.StandardContext.stopInternal(StandardContext.java:5779)
    -    at org.apache.catalina.util.LifecycleBase.stop(LifecycleBase.java:224)
    -    at org.apache.catalina.core.ContainerBase$StopChild.call(ContainerBase.java:1588)
    -    at org.apache.catalina.core.ContainerBase$StopChild.call(ContainerBase.java:1577)
    -    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    -    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    -    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    -    at java.lang.Thread.run(Thread.java:748)
    +        at org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:685)
    +        at org.apache.tomcat.jdbc.pool.ConnectionPool.getConnection(ConnectionPool.java:187)
    +        at org.apache.tomcat.jdbc.pool.DataSourceProxy.getConnection(DataSourceProxy.java:128)
    +        at org.dspace.storage.rdbms.DatabaseManager.getConnection(DatabaseManager.java:632)
    +        at org.dspace.core.Context.init(Context.java:121)
    +        at org.dspace.core.Context.<init>(Context.java:95)
    +        at org.dspace.app.util.AbstractDSpaceWebapp.deregister(AbstractDSpaceWebapp.java:97)
    +        at org.dspace.app.util.DSpaceContextListener.contextDestroyed(DSpaceContextListener.java:146)
    +        at org.apache.catalina.core.StandardContext.listenerStop(StandardContext.java:5115)
    +        at org.apache.catalina.core.StandardContext.stopInternal(StandardContext.java:5779)
    +        at org.apache.catalina.util.LifecycleBase.stop(LifecycleBase.java:224)
    +        at org.apache.catalina.core.ContainerBase$StopChild.call(ContainerBase.java:1588)
    +        at org.apache.catalina.core.ContainerBase$StopChild.call(ContainerBase.java:1577)
    +        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    +        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    +        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    +        at java.lang.Thread.run(Thread.java:748)
     2018-04-20 19:15:04,006 INFO  org.dspace.core.ConfigurationManager @ Loading from classloader: file:/home/cgspace.cgiar.org/config/dspace.cfg
    -
  • - -
  • Very suspect!

  • + - -

    2018-04-24

    - +

    2018-04-24

    - -

    2018-04-25

    - +

    2018-04-25

    $ psql dspacetest -c 'alter user dspacetest superuser;'
     $ pg_restore -O -U dspacetest -d dspacetest -W -h localhost /tmp/dspace_2018-04-18.backup
    -
    - -
  • There’s another issue with Tomcat in Ubuntu 18.04:

    - -
    25-Apr-2018 13:26:21.493 SEVERE [http-nio-127.0.0.1-8443-exec-1] org.apache.coyote.AbstractProtocol$ConnectionHandler.process Error reading request, ignored
    -java.lang.NoSuchMethodError: java.nio.ByteBuffer.position(I)Ljava/nio/ByteBuffer;
    -    at org.apache.coyote.http11.Http11InputBuffer.init(Http11InputBuffer.java:688)
    -    at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:672)
    -    at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
    -    at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:790)
    -    at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1459)
    -    at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
    -    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    -    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    -    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    -    at java.lang.Thread.run(Thread.java:748)
    -
  • - -
  • There’s a Debian bug about this from a few weeks ago

  • - -
  • Apparently Tomcat was compiled with Java 9, so doesn’t work with Java 8

  • + - -

    2018-04-29

    - +
    25-Apr-2018 13:26:21.493 SEVERE [http-nio-127.0.0.1-8443-exec-1] org.apache.coyote.AbstractProtocol$ConnectionHandler.process Error reading request, ignored
    + java.lang.NoSuchMethodError: java.nio.ByteBuffer.position(I)Ljava/nio/ByteBuffer;
    +        at org.apache.coyote.http11.Http11InputBuffer.init(Http11InputBuffer.java:688)
    +        at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:672)
    +        at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
    +        at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:790)
    +        at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1459)
    +        at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
    +        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    +        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    +        at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    +        at java.lang.Thread.run(Thread.java:748)
    +
    +

    2018-04-29

    - -

    2018-04-30

    - +

    2018-04-30

    - -

    2018-05-02

    - +

    2018-05-02

    + + +

    2018-05-03

    - -

    2018-05-06

    - +
    $ dspace metadata-export -a -f /tmp/iita.csv -i 10568/68616
    +
    +

    2018-05-06

    $ for line in $(< /tmp/links.txt); do echo $line; http --print h $line; done
    -
    - -
  • Most of the links are good, though one is duplicate and one seems to even be incorrect in the publisher’s site so…

  • - -
  • Also, there are some duplicates:

    - + +
  • +
  • Messed up abstracts:
  • - -
  • Fixed some issues in regions, countries, sponsors, ISSN, and cleaned whitespace errors from citation, abstract, author, and titles

  • - -
  • Fixed all issues with CRPs

  • - -
  • A few more interesting Unicode characters to look for in text fields like author, abstracts, and citations might be: (0x2019), · (0x00b7), and (0x20ac)

  • - -
  • A custom text facit in OpenRefine with this GREL expression could be a good for finding invalid characters or encoding errors in authors, abstracts, etc:

    - + +
  • +
  • Fixed some issues in regions, countries, sponsors, ISSN, and cleaned whitespace errors from citation, abstract, author, and titles
  • +
  • Fixed all issues with CRPs
  • +
  • A few more interesting Unicode characters to look for in text fields like author, abstracts, and citations might be: (0x2019), · (0x00b7), and (0x20ac)
  • +
  • A custom text facit in OpenRefine with this GREL expression could be a good for finding invalid characters or encoding errors in authors, abstracts, etc:
  • +
    or(
    -isNotNull(value.match(/.*[(|)].*/)),
    -isNotNull(value.match(/.*\uFFFD.*/)),
    -isNotNull(value.match(/.*\u00A0.*/)),
    -isNotNull(value.match(/.*\u200A.*/)),
    -isNotNull(value.match(/.*\u2019.*/)),
    -isNotNull(value.match(/.*\u00b7.*/)),
    -isNotNull(value.match(/.*\u20ac.*/))
    +  isNotNull(value.match(/.*[(|)].*/)),
    +  isNotNull(value.match(/.*\uFFFD.*/)),
    +  isNotNull(value.match(/.*\u00A0.*/)),
    +  isNotNull(value.match(/.*\u200A.*/)),
    +  isNotNull(value.match(/.*\u2019.*/)),
    +  isNotNull(value.match(/.*\u00b7.*/)),
    +  isNotNull(value.match(/.*\u20ac.*/))
     )
    -
    - -
  • I found some more IITA records that Sisay imported on 2018-03-23 that have invalid CRP names, so now I kinda want to check those ones!

  • - -
  • Combine the ORCID identifiers Abenet sent with our existing list and resolve their names using the resolve-orcids.py script:

    - +
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ilri-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2018-05-06-combined.txt
     $ ./resolve-orcids.py -i /tmp/2018-05-06-combined.txt -o /tmp/2018-05-06-combined-names.txt -d
     # sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
     $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
    -
  • - -
  • I made a pull request (#373) for this that I’ll merge some time next week (I’m expecting Atmire to get back to us about DSpace 5.8 soon)

  • - -
  • After testing quickly I just decided to merge it, and I noticed that I don’t even need to restart Tomcat for the changes to get loaded

  • + - -

    2018-05-07

    - +

    2018-05-07

    - -

    2018-05-09

    - +

    2018-05-09

    $ xmllint --xpath '//value-pairs[@value-pairs-name="crpsubject"]/pair/stored-value/node()' dspace/config/input-forms.xml        
     Agriculture for Nutrition and HealthBig DataClimate Change, Agriculture and Food SecurityExcellence in BreedingFishForests, Trees and AgroforestryGenebanksGrain Legumes and Dryland CerealsLivestockMaizePolicies, Institutions and MarketsRiceRoots, Tubers and BananasWater, Land and EcosystemsWheatAquatic Agricultural SystemsDryland CerealsDryland SystemsGrain LegumesIntegrated Systems for the Humid TropicsLivestock and Fish
    -
    - -
  • Maybe xmlstarlet is better:

    - +
    $ xmlstarlet sel -t -v '//value-pairs[@value-pairs-name="crpsubject"]/pair/stored-value/text()' dspace/config/input-forms.xml
     Agriculture for Nutrition and Health
     Big Data
    @@ -285,209 +261,163 @@ Dryland Systems
     Grain Legumes
     Integrated Systems for the Humid Tropics
     Livestock and Fish
    -
  • - -
  • Discuss Colombian BNARS harvesting the CIAT data from CGSpace

  • - -
  • They are using a system called Primo and the only options for data harvesting in that system are via FTP and OAI

  • - -
  • I told them to get all CIAT records via OAI

  • - -
  • Just a note to myself, I figured out how to get reconcile-csv to run from source rather than running the old pre-compiled JAR file:

    - -
    $ lein run /tmp/crps.csv name id
    -
  • - -
  • I tried to reconcile against a CSV of our countries but reconcile-csv crashes

  • + - -

    2018-05-13

    - +
    $ lein run /tmp/crps.csv name id
    +
    +

    2018-05-13

    - -

    2018-05-14

    - + + +

    2018-05-14

    - -

    2018-05-15

    - +

    2018-05-15

    import urllib2
     import re
     
     pattern = re.compile('.*10.1016.*')
     if pattern.match(value):
    -get = urllib2.urlopen(value)
    -return get.getcode()
    +  get = urllib2.urlopen(value)
    +  return get.getcode()
     
     return "blank"
    -
    - -
  • I used a regex to limit it to just some of the DOIs in this case because there were thousands of URLs

  • - -
  • Here the response code would be 200, 404, etc, or “blank” if there is no URL for that item

  • - -
  • You could use this in a facet or in a new column

  • - -
  • More information and good examples here: https://programminghistorian.org/lessons/fetch-and-parse-data-with-openrefine

  • - -
  • Finish looking at the 2,640 CIFOR records on DSpace Test (1056892904), cleaning up authors and adding collection mappings

  • - -
  • They can now be moved to CGSpace as far as I’m concerned, but I don’t know if Sisay will do it or me

  • - -
  • I was checking the CIFOR data for duplicates using Atmire’s Metadata Quality Module (and found some duplicates actually), but then DSpace died…

  • - -
  • I didn’t see anything in the Tomcat, DSpace, or Solr logs, but I saw this in dmest -T:

    - +
    [Tue May 15 12:10:01 2018] Out of memory: Kill process 3763 (java) score 706 or sacrifice child
     [Tue May 15 12:10:01 2018] Killed process 3763 (java) total-vm:14667688kB, anon-rss:5705268kB, file-rss:0kB, shmem-rss:0kB
     [Tue May 15 12:10:01 2018] oom_reaper: reaped process 3763 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    -
  • - -
  • So the Linux kernel killed Java…

  • - -
  • Maria from Bioversity mailed to say she got an error while submitting an item on CGSpace:

    - +
    Unable to load Submission Information, since WorkspaceID (ID:S96060) is not a valid in-process submission
    -
  • - -
  • Looking in the DSpace log I see something related:

    - +
    2018-05-15 12:35:30,858 INFO  org.dspace.submit.step.CompleteStep @ m.garruccio@cgiar.org:session_id=8AC4499945F38B45EF7A1226E3042DAE:submission_complete:Completed submission with id=96060
    -
  • - -
  • So I’m not sure…

  • - -
  • I finally figured out how to get OpenRefine to reconcile values from Solr via conciliator:

  • - -
  • The trick was to use a more appropriate Solr fieldType text_en instead of text_general so that more terms match, for example uppercase and lower case:

    - +
    $ ./bin/solr start
     $ ./bin/solr create_core -c countries
     $ curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": {"name":"country", "type":"text_en", "multiValued":false, "stored":true}}' http://localhost:8983/solr/countries/schema
     $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
    -
  • - -
  • It still doesn’t catch simple mistakes like “ALBANI” or “AL BANIA” for “ALBANIA”, and it doesn’t return scores, so I have to select matches manually:

  • + - -

    OpenRefine reconciling countries from local Solr

    - +

    OpenRefine reconciling countries from local Solr

    <defaultSearchField>search_text</defaultSearchField>
     ...
     <copyField source="*" dest="search_text"/>
    -
    - -
  • Actually, I wonder how much of their schema I could just copy…

  • - -
  • Apparently the default search field is the df parameter and you could technically just add it to the query string, so no need to bother with that in the schema now

  • - -
  • I copied over the DSpace search_text field type from the DSpace Solr config (had to remove some properties so Solr would start) but it doesn’t seem to be any better at matching than the text_en type

  • - -
  • I think I need to focus on trying to return scores with conciliator

  • + - -

    2018-05-16

    - +

    2018-05-16

    +
  • Silvia asked if I could sort the records in her Listings and Report output and it turns out that the options are misconfigured in dspace/config/modules/atmire-listings-and-reports.cfg
  • I created and merged a pull request to fix the sorting issue in Listings and Reports (#374)
  • - -
  • Regarding the IP Address Anonymization for GDPR, I ammended the Google Analytics snippet in page-structure-alterations.xsl to:

    - -
    ga('send', 'pageview', {
    -'anonymizeIp': true
    -});
    -
  • - -
  • I tested loading a certain page before and after adding this and afterwards I saw that the parameter aip=1 was being sent with the analytics response to Google

  • - -
  • According to the analytics.js protocol parameter documentation this means that IPs are being anonymized

  • - -
  • After finding and fixing some duplicates in IITA’s IITA_April_27 test collection on DSpace Test (1056892703) I told Sisay that he can move them to IITA’s Journal Articles collection on CGSpace

  • +
  • Regarding the IP Address Anonymization for GDPR, I ammended the Google Analytics snippet in page-structure-alterations.xsl to:
  • - -

    2018-05-17

    - +
    ga('send', 'pageview', {
    +  'anonymizeIp': true
    +});
    +
    +

    2018-05-17

    - -

    2018-05-18

    - +

    2018-05-18

    - -

    2018-05-20

    - +

    2018-05-20

    - -

    2018-05-21

    - +

    2018-05-21

    - -

    2018-05-22

    - +

    2018-05-22

    - -

    2018-05-23

    - +

    2018-05-23

    - -

    2018-05-28

    - +
    dspace=# select email, netid from eperson where email not like '%cgiar.org%' and email like '%@%';
    +
    +

    2018-05-28

    - -

    2018-05-30

    - +

    2018-05-30

    [Wed May 30 00:00:39 2018] Out of memory: Kill process 6082 (java) score 697 or sacrifice child
     [Wed May 30 00:00:39 2018] Killed process 6082 (java) total-vm:14876264kB, anon-rss:5683372kB, file-rss:0kB, shmem-rss:0kB
     [Wed May 30 00:00:40 2018] oom_reaper: reaped process 6082 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    -
    - -
  • I need to check the Tomcat JVM heap size/usage, command line JVM heap size (for cron jobs), and PostgreSQL memory usage

  • - -
  • It might be possible to adjust some things, but eventually we’ll need a larger VPS instance

  • - -
  • For some reason there are no JVM stats in Munin, ugh

  • - -
  • Run all system updates on DSpace Test and reboot it

  • - -
  • I generated a list of CIFOR duplicates from the CIFOR_May_9 collection using the Atmire MQM module and then dumped the HTML source so I could process it for sending to Vika

  • - -
  • I used grep to filter all relevant handle lines from the HTML source then used sed to insert a newline before each “Item1” line (as the duplicates are grouped like Item1, Item2, Item3 for each set of duplicates):

    - +
    $ grep -E 'aspect.duplicatechecker.DuplicateResults.field.del_handle_[0-9]{1,3}_Item' ~/Desktop/https\ _dspacetest.cgiar.org_atmire_metadata-quality_duplicate-checker.html > ~/cifor-duplicates.txt
     $ sed 's/.*Item1.*/\n&/g' ~/cifor-duplicates.txt > ~/cifor-duplicates-cleaned.txt
    -
  • - -
  • I told Vika to look through the list manually and indicate which ones are indeed duplicates that we should delete, and which ones to map to CIFOR’s collection

  • - -
  • A few weeks ago Peter wanted a list of authors from the ILRI collections, so I need to find a way to get the handles of all those collections

  • - -
  • I can use the /communities/{id}/collections endpoint of the REST API but it only takes IDs (not handles) and doesn’t seem to descend into sub communities

  • - -
  • Shit, so I need the IDs for the the top-level ILRI community and all its sub communities (and their sub communities)

  • - -
  • There has got to be a better way to do this than going to each community and getting their handles and IDs manually

  • - -
  • Oh shit, I literally already wrote a script to get all collections in a community hierarchy from the REST API: rest-find-collections.py

  • - -
  • The output isn’t great, but all the handles and IDs are printed in debug mode:

    - -
    $ ./rest-find-collections.py -u https://cgspace.cgiar.org/rest -d 10568/1 2> /tmp/ilri-collections.txt
    -
  • - -
  • Then I format the list of handles and put it into this SQL query to export authors from items ONLY in those collections (too many to list here):

    - -
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/67236','10568/67274',...))) group by text_value order by count desc) to /tmp/ilri-authors.csv with csv;
    -
  • + - -

    2018-05-31

    - +
    $ ./rest-find-collections.py -u https://cgspace.cgiar.org/rest -d 10568/1 2> /tmp/ilri-collections.txt
    +
    +
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/67236','10568/67274',...))) group by text_value order by count desc) to /tmp/ilri-authors.csv with csv;
    +

    2018-05-31

    $ docker pull postgres:9.5-alpine
     $ docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.5-alpine
     $ createuser -h localhost -U postgres --pwprompt dspacetest
    @@ -595,8 +494,7 @@ $ pg_restore -h localhost -O -U dspacetest -d dspacetest -W -h localhost ~/Downl
     $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
     $ psql -h localhost -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
     $ psql -h localhost -U postgres dspacetest
    -
    - + diff --git a/docs/2018-06/index.html b/docs/2018-06/index.html index 6161f4a90..4c391ce99 100644 --- a/docs/2018-06/index.html +++ b/docs/2018-06/index.html @@ -8,21 +8,17 @@ @@ -41,21 +36,17 @@ sys 2m7.289s - + @@ -146,49 +136,39 @@ sys 2m7.289s

    -

    2018-06-04

    - +

    2018-06-04

    +
  • I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
  • - -
  • I proofed and tested the ILRI author corrections that Peter sent back to me this week:

    - +
  • I proofed and tested the ILRI author corrections that Peter sent back to me this week:
  • +
    $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
    -
    - -
  • I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018

  • - -
  • Time to index ~70,000 items on CGSpace:

    - +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
     
     real    74m42.646s
     user    8m5.056s
     sys     2m7.289s
    -
  • - - -

    2018-06-06

    - +

    2018-06-06

    - -

    2018-06-07

    - +

    2018-06-07

    +
  • I uploaded fixes for all those now, but I will continue with the rest of the data later
  • - -
  • Regarding the SQL migration errors, Atmire told me I need to run some migrations manually in PostgreSQL:

    - +
  • Regarding the SQL migration errors, Atmire told me I need to run some migrations manually in PostgreSQL:
  • +
    delete from schema_version where version = '5.6.2015.12.03.2';
     update schema_version set version = '5.6.2015.12.03.2' where version = '5.5.2015.12.03.2';
     update schema_version set version = '5.8.2015.12.03.3' where version = '5.5.2015.12.03.3';
    -
    - -
  • And then I need to ignore the ignored ones:

    - -
    $ ~/dspace/bin/dspace database migrate ignored
    -
  • - -
  • Now DSpace starts up properly!

  • - -
  • Gabriela from CIP got back to me about the author names we were correcting on CGSpace

  • - -
  • I did a quick sanity check on them and then did a test import with my fix-metadata-value.py script:

    - -
    $ ./fix-metadata-values.py -i /tmp/2018-06-08-CIP-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
    -
  • - -
  • I will apply them on CGSpace tomorrow I think…

  • + - -

    2018-06-09

    - +
    $ ~/dspace/bin/dspace database migrate ignored
    +
    +
    $ ./fix-metadata-values.py -i /tmp/2018-06-08-CIP-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
    +
    +

    2018-06-09

    - -

    2018-06-10

    - +

    2018-06-10

    +
     INFO [org.dspace.servicemanager.DSpaceServiceManager] Shutdown DSpace core service manager
     Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'org.dspace.servicemanager.spring.DSpaceBeanPostProcessor#0' defined in class path resource [spring/spring-dspace-applicationContext.xml]: Unsatisfied dependency expressed through constructor argument with index 0 of type [org.dspace.servicemanager.config.DSpaceConfigurationService]: : Cannot find class [com.atmire.dspace.discovery.ItemCollectionPlugin] for bean with name 'itemCollectionPlugin' defined in file [/home/aorth/dspace/config/spring/api/discovery.xml];
    -
    - -
  • I can fix this by commenting out the ItemCollectionPlugin line of discovery.xml, but from looking at the git log I’m not actually sure if that is related to MQM or not

  • - -
  • I will have to ask Atmire

  • - -
  • I continued to look at Sisay’s IITA records from last week

    - +
    +
  • +
  • I will have to tell IITA people to redo these entirely I think…
  • + +

    2018-06-11

    + +
  • It turns out that Abenet actually did a lot of small corrections on this data so when Sisay uses Bosede's original file it doesn't have all those corrections
  • +
  • So I told Sisay to re-create the collection using Abenet's XLS from last week (Mercy1805_AY.xls)
  • I was curious to see if I could create a GREL for use with a custom text facet in Open Refine to find cells with two or more consecutive spaces
  • I always use the built-in trim and collapse transformations anyways, but this seems to work to find the offending cells: isNotNull(value.match(/.*?\s{2,}.*?/))
  • I wonder if I should start checking for “smart” quotes like ’ (hex 2019)
  • - -

    2018-06-12

    - +

    2018-06-12

    + + +
    or(
    +  value.contains('€'),
    +  value.contains('6g'),
    +  value.contains('6m'),
    +  value.contains('6d'),
    +  value.contains('6e')
    +)
    +
    +

    2018-06-13

    +
    $ ./add-orcid-identifiers-csv.py -i 2018-06-13-Robin-Buruchara.csv -db dspace -u dspace -p 'fuuu'
    -
    - -
  • The contents of 2018-06-13-Robin-Buruchara.csv were:

    - +
    dc.contributor.author,cg.creator.id
     "Buruchara, Robin",Robin Buruchara: 0000-0003-0934-1218
     "Buruchara, Robin A.",Robin Buruchara: 0000-0003-0934-1218
    -
  • - -
  • On a hunch I checked to see if CGSpace’s bitstream cleanup was working properly and of course it’s broken:

    - +
    $ dspace cleanup -v
     ...
     Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -Detail: Key (bitstream_id)=(152402) is still referenced from table "bundle".
    -
  • - -
  • As always, the solution is to delete that ID manually in PostgreSQL:

    - + Detail: Key (bitstream_id)=(152402) is still referenced from table "bundle". +
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (152402);'
     UPDATE 1
    -
  • - - -

    2018-06-14

    - +

    2018-06-14

    - -

    2018-06-24

    - +

    2018-06-24

    $ dropdb -h localhost -U postgres dspacetest
     $ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
     $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
     $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost /tmp/cgspace_2018-06-24.backup
     $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
    -
    - -
  • The -O option to pg_restore makes the import process ignore ownership specified in the dump itself, and instead makes the schema owned by the user doing the restore

  • - -
  • I always prefer to use the postgres user locally because it’s just easier than remembering the dspacetest user’s password, but then I couldn’t figure out why the resulting schema was owned by postgres

  • - -
  • So with this you connect as the postgres superuser and then switch roles to dspacetest (also, make sure this user has superuser privileges before the restore)

  • - -
  • Last week Linode emailed me to say that our Linode 8192 instance used for DSpace Test qualified for an upgrade

  • - -
  • Apparently they announced some upgrades to most of their plans in 2018-05

  • - -
  • After the upgrade I see we have more disk space available in the instance’s dashboard, so I shut the instance down and resized it from 98GB to 160GB

  • - -
  • The resize was very quick (less than one minute) and after booting the instance back up I now have 160GB for the root filesystem!

  • - -
  • I will move the DSpace installation directory back to the root file system and delete the extra 300GB block storage, as it was actually kinda slow when we put Solr there and now we don’t actually need it anymore because running the production Solr on this instance didn’t work well with 8GB of RAM

  • - -
  • Also, the larger instance we’re using for CGSpace will go from 24GB of RAM to 32, and will also get a storage increase from 320GB to 640GB… that means we don’t need to consider using block storage right now!

  • - -
  • The smaller instances get increased storage and network speed but I doubt many are actually using much of their current allocations so we probably don’t need to bother with upgrading them

  • - -
  • Last week Abenet asked if we could add dc.language.iso to the advanced search filters

  • - -
  • There is already a search filter for this field defined in discovery.xml but we aren’t using it, so I quickly enabled and tested it, then merged it to the 5_x-prod branch (#380)

  • - -
  • Back to testing the DSpace 5.8 changes from Atmire, I had another issue with SQL migrations:

    - +
    Caused by: org.flywaydb.core.api.FlywayException: Validate failed. Found differences between applied migrations and available migrations: Detected applied migration missing on the classpath: 5.8.2015.12.03.3
    -
  • - -
  • It took me a while to figure out that this migration is for MQM, which I removed after Atmire’s original advice about the migrations so we actually need to delete this migration instead up updating it

  • - -
  • So I need to make sure to run the following during the DSpace 5.8 upgrade:

    - +
    -- Delete existing CUA 4 migration if it exists
     delete from schema_version where version = '5.6.2015.12.03.2';
     
    @@ -458,55 +409,41 @@ update schema_version set version = '5.6.2015.12.03.2' where version = '5.5.2015
     
     -- Delete MQM migration since we're no longer using it
     delete from schema_version where version = '5.5.2015.12.03.3';
    -
  • - -
  • After that you can run the migrations manually and then DSpace should work fine:

    - +
    $ ~/dspace/bin/dspace database migrate ignored
     ...
     Done.
    -
  • - -
  • Elizabeth from CIAT contacted me to ask if I could add ORCID identifiers to all of Andy Jarvis’ items on CGSpace

  • - -
  • I used my add-orcid-identifiers-csv.py script:

    - +
    $ ./add-orcid-identifiers-csv.py -i 2018-06-24-andy-jarvis-orcid.csv -db dspacetest -u dspacetest -p 'fuuu'
    -
  • - -
  • The contents of 2018-06-24-andy-jarvis-orcid.csv were:

    - +
    dc.contributor.author,cg.creator.id
     "Jarvis, A.",Andy Jarvis: 0000-0001-6543-0798
     "Jarvis, Andy",Andy Jarvis: 0000-0001-6543-0798
     "Jarvis, Andrew",Andy Jarvis: 0000-0001-6543-0798
    -
  • - - -

    2018-06-26

    - +

    2018-06-26

    - -

    2018-06-27

    - +
    2018-06-26 16:58:12,052 WARN  org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
    +
    +

    2018-06-27

    $ grep delete 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-delete.txt
     $ wc -l cifor-handle-to-delete.txt
     62 cifor-handle-to-delete.txt
    @@ -515,56 +452,40 @@ $ wc -l 10568-92904.csv
     $ while read line; do sed -i "\#$line#d" 10568-92904.csv; done < cifor-handle-to-delete.txt
     $ wc -l 10568-92904.csv
     2399 10568-92904.csv
    -
    - -
  • This iterates over the handles for deletion and uses sed with an alternative pattern delimiter of ‘#’ (which must be escaped), because the pattern itself contains a ‘/’

  • - -
  • The mapped ones will be difficult because we need their internal IDs in order to map them, and there are 50 of them:

    - +
    $ grep map 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-map.txt
     $ wc -l cifor-handle-to-map.txt
     50 cifor-handle-to-map.txt
    -
  • - -
  • I can either get them from the databse, or programatically export the metadata using dspace metadata-export -i 10568/xxxxx

  • - -
  • Oooh, I can export the items one by one, concatenate them together, remove the headers, and extract the id and collection columns using csvkit:

    - +
    $ while read line; do filename=${line/\//-}.csv; dspace metadata-export -i $line -f $filename; done < /tmp/cifor-handle-to-map.txt
     $ sed '/^id/d' 10568-*.csv | csvcut -c 1,2 > map-to-cifor-archive.csv
    -
  • - -
  • Then I can use Open Refine to add the “CIFOR Archive” collection to the mappings

  • - -
  • Importing the 2398 items via dspace metadata-import ends up with a Java garbage collection error, so I think I need to do it in batches of 1,000

  • - -
  • After deleting the 62 duplicates, mapping the 50 items from elsewhere in CGSpace, and uploading 2,398 unique items, there are a total of 2,448 items added in this batch

  • - -
  • I’ll let Abenet take one last look and then move them to CGSpace

  • + - -

    2018-06-28

    - +

    2018-06-28

    [Thu Jun 28 00:00:30 2018] Out of memory: Kill process 14501 (java) score 701 or sacrifice child
     [Thu Jun 28 00:00:30 2018] Killed process 14501 (java) total-vm:14926704kB, anon-rss:5693608kB, file-rss:0kB, shmem-rss:0kB
     [Thu Jun 28 00:00:30 2018] oom_reaper: reaped process 14501 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    -
    - -
  • Look over IITA’s IITA_Jan_9_II_Ab collection from earlier this month on DSpace Test

  • - -
  • Bosede fixed a few things (and seems to have removed many French IITA subjects like AMÉLIORATION DES PLANTES and SANTÉ DES PLANTES)

  • - -
  • I still see at least one issue with author affiliations, and I didn’t bother to check the AGROVOC subjects because it’s such a mess aanyways

  • - -
  • I suggested that IITA provide an updated list of subject to us so we can include their controlled vocabulary in CGSpace, which would also make it easier to do automated validation

  • + - - + diff --git a/docs/2018-07/index.html b/docs/2018-07/index.html index 79dd13eff..2a4d1526c 100644 --- a/docs/2018-07/index.html +++ b/docs/2018-07/index.html @@ -8,16 +8,13 @@ @@ -28,18 +25,15 @@ There is insufficient memory for the Java Runtime Environment to continue. - + @@ -120,29 +114,23 @@ There is insufficient memory for the Java Runtime Environment to continue.

    -

    2018-07-01

    - +

    2018-07-01

    +
    $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
    +
    +
    There is insufficient memory for the Java Runtime Environment to continue.
    +
    - - - -
  • I noticed there are many items that use HTTP instead of HTTPS for their Google Books URL, and some missing HTTP entirely:

    - +
    dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value like 'http://books.google.%';
    -count
    + count
     -------
    -785
    +   785
     dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value ~ '^books\.google\..*';
    -count
    + count
     -------
    - 4
    -
  • - -
  • I think I should fix that as well as some other garbage values like “test” and “dspace.ilri.org” etc:

    - + 4 +
    dspace=# begin;
     dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://books.google', 'https://books.google') where resource_type_id=2 and metadata_field_id=222 and text_value like 'http://books.google.%';
     UPDATE 785
    @@ -210,12 +190,11 @@ UPDATE 1
     dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=222 and metadata_value_id in (2299312, 10684, 10700, 996403);
     DELETE 4
     dspace=# commit;
    -
  • - -
  • Testing DSpace 5.8 with PostgreSQL 9.6 and Tomcat 8.5.32 (instead of my usual 7.0.88) and for some reason I get autowire errors on Catalina startup with 8.5.32:

    - +
    03-Jul-2018 19:51:37.272 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
    -java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
    + java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
     	at org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener.contextInitialized(DSpaceKernelServletContextListener.java:92)
     	at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4792)
     	at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5256)
    @@ -231,269 +210,208 @@ java.lang.RuntimeException: Failure during filter init: Failed to startup the DS
     	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     	at java.lang.Thread.run(Thread.java:748)
     Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
    -
  • - -
  • Gotta check that out later…

  • + - -

    2018-07-04

    - +

    2018-07-04

    - -

    2018-07-06

    - +

    2018-07-06

    - -

    2018-07-08

    - +

    2018-07-08

    # s3cmd sync --delete-removed /home/backup/solr/ s3://cgspace.cgiar.org/solr/
    -
    - -
  • But I need to add this to cron!

  • - -
  • I wonder if I should convert some of the cron jobs to systemd services / timers…

  • - -
  • I sent a note to all our users on Yammer to ask them about possible maintenance on Sunday, July 14th

  • - -
  • Abenet wants to be able to search by journal title (dc.source) in the advanced Discovery search so I opened an issue for it (#384)

  • - -
  • I regenerated the list of names for all our ORCID iDs using my resolve-orcids.py script:

    - +
    $ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq > /tmp/2018-07-08-orcids.txt
     $ ./resolve-orcids.py -i /tmp/2018-07-08-orcids.txt -o /tmp/2018-07-08-names.txt -d
    -
  • - -
  • But after comparing to the existing list of names I didn’t see much change, so I just ignored it

  • + - -

    2018-07-09

    - +

    2018-07-09

    Exception in thread "http-bio-127.0.0.1-8081-exec-557" java.lang.OutOfMemoryError: Java heap space
    -
    - -
  • I’m not sure if it’s the same error, but I see this in DSpace’s solr.log:

    - +
    2018-07-09 06:25:09,913 ERROR org.apache.solr.servlet.SolrDispatchFilter @ null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
    -
  • - -
  • I see a strange error around that time in dspace.log.2018-07-08:

    - +
    2018-07-09 06:23:43,510 ERROR com.atmire.statistics.SolrLogThread @ IOException occured when talking to server at: http://localhost:8081/solr/statistics
     org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr/statistics
    -
  • - -
  • But not sure what caused that…

  • - -
  • I got a message from Linode tonight that CPU usage was high on CGSpace for the past few hours around 8PM GMT

  • - -
  • Looking in the nginx logs I see the top ten IP addresses active today:

    - +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "09/Jul/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -1691 40.77.167.84
    -1701 40.77.167.69
    -1718 50.116.102.77
    -1872 137.108.70.6
    -2172 157.55.39.234
    -2190 207.46.13.47
    -2848 178.154.200.38
    -4367 35.227.26.162
    -4387 70.32.83.92
    -4738 95.108.181.88
    -
  • - -
  • Of those, all except 70.32.83.92 and 50.116.102.77 are NOT re-using their Tomcat sessions, for example from the XMLUI logs:

    - + 1691 40.77.167.84 + 1701 40.77.167.69 + 1718 50.116.102.77 + 1872 137.108.70.6 + 2172 157.55.39.234 + 2190 207.46.13.47 + 2848 178.154.200.38 + 4367 35.227.26.162 + 4387 70.32.83.92 + 4738 95.108.181.88 +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-07-09
     4435
    -
  • - -
  • 95.108.181.88 appears to be Yandex, so I dunno why it’s creating so many sessions, as its user agent should match Tomcat’s Crawler Session Manager Valve

  • - -
  • 70.32.83.92 is on MediaTemple but I’m not sure who it is. They are mostly hitting REST so I guess that’s fine

  • - -
  • 35.227.26.162 doesn’t declare a user agent and is on Google Cloud, so I should probably mark them as a bot in nginx

  • - -
  • 178.154.200.38 is Yandex again

  • - -
  • 207.46.13.47 is Bing

  • - -
  • 157.55.39.234 is Bing

  • - -
  • 137.108.70.6 is our old friend CORE bot

  • - -
  • 50.116.102.77 doesn’t declare a user agent and lives on HostGator, but mostly just hits the REST API so I guess that’s fine

  • - -
  • 40.77.167.84 is Bing again

  • - -
  • Interestingly, the first time that I see 35.227.26.162 was on 2018-06-08

  • - -
  • I’ve added 35.227.26.162 to the bot tagging logic in the nginx vhost

  • + - -

    2018-07-10

    - +

    2018-07-10

    - -

    2018-07-11

    - +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Jul/2018:(11|12|13)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +     81 193.95.22.113
    +     82 50.116.102.77
    +    112 40.77.167.90
    +    117 196.190.95.98
    +    120 178.154.200.38
    +    215 40.77.167.96
    +    243 41.204.190.40
    +    415 95.108.181.88
    +    695 35.227.26.162
    +    697 213.139.52.250
    +
    +
    213.139.52.250 - - [10/Jul/2018:13:39:41 +0000] "GET /bitstream/handle/10568/75668/dryad.png HTTP/2.0" 200 53750 "http://localhost:4200/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36"
    +
    +

    2018-07-11

    - -

    2018-07-12

    - + + +

    2018-07-12

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "11/Jul/2018:22" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    - 48 66.249.64.91
    - 50 35.227.26.162
    - 57 157.55.39.234
    - 59 157.55.39.71
    - 62 147.99.27.190
    - 82 95.108.181.88
    - 92 40.77.167.90
    - 97 183.128.40.185
    - 97 240e:f0:44:fa53:745a:8afe:d221:1232
    -3634 208.110.72.10
    +     48 66.249.64.91
    +     50 35.227.26.162
    +     57 157.55.39.234
    +     59 157.55.39.71
    +     62 147.99.27.190
    +     82 95.108.181.88
    +     92 40.77.167.90
    +     97 183.128.40.185
    +     97 240e:f0:44:fa53:745a:8afe:d221:1232
    +   3634 208.110.72.10
     # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "12/Jul/2018:00" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    - 25 216.244.66.198
    - 38 40.77.167.185
    - 46 66.249.64.93
    - 56 157.55.39.71
    - 60 35.227.26.162
    - 65 157.55.39.234
    - 83 95.108.181.88
    - 87 66.249.64.91
    - 96 40.77.167.90
    -7075 208.110.72.10
    -
    - -
  • We have never seen 208.110.72.10 before… so that’s interesting!

  • - -
  • The user agent for these requests is: Pcore-HTTP/v0.44.0

  • - -
  • A brief Google search doesn’t turn up any information about what this bot is, but lots of users complaining about it

  • - -
  • This bot does make a lot of requests all through the day, although it seems to re-use its Tomcat session:

    - + 25 216.244.66.198 + 38 40.77.167.185 + 46 66.249.64.93 + 56 157.55.39.71 + 60 35.227.26.162 + 65 157.55.39.234 + 83 95.108.181.88 + 87 66.249.64.91 + 96 40.77.167.90 + 7075 208.110.72.10 +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -17098 208.110.72.10
    +  17098 208.110.72.10
     # grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-11
     1161
     # grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-12
     1885
    -
  • - -
  • I think the problem is that, despite the bot requesting robots.txt, it almost exlusively requests dynamic pages from /discover:

    - +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | grep -o -E "GET /(browse|discover|search-filter)" | sort -n | uniq -c | sort -rn
    -13364 GET /discover
    -993 GET /search-filter
    -804 GET /browse
    +  13364 GET /discover
    +    993 GET /search-filter
    +    804 GET /browse
     # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | grep robots
     208.110.72.10 - - [12/Jul/2018:00:22:28 +0000] "GET /robots.txt HTTP/1.1" 200 1301 "https://cgspace.cgiar.org/robots.txt" "Pcore-HTTP/v0.44.0"
    -
  • - -
  • So this bot is just like Baiduspider, and I need to add it to the nginx rate limiting

  • - -
  • I’ll also add it to Tomcat’s Crawler Session Manager Valve to force the re-use of a common Tomcat sesssion for all crawlers just in case

  • - -
  • Generate a list of all affiliations in CGSpace to send to Mohamed Salem to compare with the list on MEL (sorting the list by most occurrences):

    - +
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where resource_type_id=2 and metadata_field_id=211 group by text_value order by count desc) to /tmp/affiliations.csv with csv header
     COPY 4518
     dspace=# \q
     $ csvcut -c 1 < /tmp/affiliations.csv > /tmp/affiliations-1.csv
    -
  • - -
  • We also need to discuss standardizing our countries and comparing our ORCID iDs

  • + - -

    2018-07-13

    - +

    2018-07-13

    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv header;
     COPY 4518
    -
    - - -

    2018-07-15

    - +

    2018-07-15

    $ dspace oai import -c
     OAI 2.0 manager action started
     Clearing index
    @@ -507,60 +425,47 @@ Full import
     Total: 73925 items
     Purging cached OAI responses.
     OAI 2.0 manager action ended. It took 697 seconds.
    -
    - -
  • Now I see four colletions in OAI for that item!

  • - -
  • I need to ask the dspace-tech mailing list if the nightly OAI import catches the case of old items that have had metadata or mappings change

  • - -
  • ICARDA sent me a list of the ORCID iDs they have in the MEL system and it looks like almost 150 are new and unique to us!

    - +
    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
     1020
     $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
     1158
    -
  • - -
  • I combined the two lists and regenerated the names for all our the ORCID iDs using my resolve-orcids.py script:

    - +
    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2018-07-15-orcid-ids.txt
     $ ./resolve-orcids.py -i /tmp/2018-07-15-orcid-ids.txt -o /tmp/2018-07-15-resolved-orcids.txt -d
    -
  • - -
  • Then I added the XML formatting for controlled vocabularies, sorted the list with GNU sort in vim via % !sort and then checked the formatting with tidy:

    - -
    $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
    -
  • - -
  • I will check with the CGSpace team to see if they want me to add these to CGSpace

  • - -
  • Help Udana from WLE understand some Altmetrics concepts

  • + - -

    2018-07-18

    - +
    $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
    +
    +

    2018-07-18

    178.33.237.157 - - [09/Jul/2018:17:00:46 +0000] "GET /oai/request?verb=ListRecords&resumptionToken=oai_dc////100 HTTP/1.1" 200 58653 "-" "Apache-HttpClient/4.5.2 (Java/1.8.0_121)"
     178.33.237.157 - - [09/Jul/2018:17:01:11 +0000] "GET /oai/request?verb=ListRecords&resumptionToken=oai_dc////200 HTTP/1.1" 200 67950 "-" "Apache-HttpClient/4.5.2 (Java/1.8.0_121)"
     ...
     178.33.237.157 - - [09/Jul/2018:22:10:39 +0000] "GET /oai/request?verb=ListRecords&resumptionToken=oai_dc////73900 HTTP/1.1" 20 0 25049 "-" "Apache-HttpClient/4.5.2 (Java/1.8.0_121)"
    -
    - -
  • So if they are getting 100 records per OAI request it would take them 739 requests

  • - -
  • I wonder if I should add this user agent to the Tomcat Crawler Session Manager valve… does OAI use Tomcat sessions?

  • - -
  • Appears not:

    - +
    $ http --print Hh 'https://cgspace.cgiar.org/oai/request?verb=ListRecords&resumptionToken=oai_dc////100'
     GET /oai/request?verb=ListRecords&resumptionToken=oai_dc////100 HTTP/1.1
     Accept: */*
    @@ -581,81 +486,61 @@ Vary: Accept-Encoding
     X-Content-Type-Options: nosniff
     X-Frame-Options: SAMEORIGIN
     X-XSS-Protection: 1; mode=block
    -
  • - - -

    2018-07-19

    - +

    2018-07-19

    - -

    2018-07-22

    - +

    2018-07-22

    - -

    2018-07-23

    - +
    webui.itemlist.sort-option.3 = dateaccessioned:dc.date.accessioned:date
    +
    +

    2018-07-23

    dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ '^[0-9]{4}$';
    -count
    + count
     -------
    -53292
    + 53292
     (1 row)
     dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ '^[0-9]{4}-[0-9]{2}$';
    -count
    + count
     -------
    -3818
    +  3818
     (1 row)
     dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ '^[0-9]{4}-[0-9]{2}-[0-9]{2}$';
    -count
    + count
     -------
    -17357
    -
    - -
  • So it looks like YYYY is the most numerious, followed by YYYY-MM-DD, then YYYY-MM

  • + 17357 + - -

    2018-07-26

    - +

    2018-07-26

    - -

    2018-07-27

    - +

    2018-07-27

    - - + diff --git a/docs/2018-08/index.html b/docs/2018-08/index.html index efd27605d..a836b60cd 100644 --- a/docs/2018-08/index.html +++ b/docs/2018-08/index.html @@ -8,24 +8,17 @@ @@ -37,27 +30,20 @@ I ran all system updates on DSpace Test and rebooted it - + @@ -138,101 +124,69 @@ I ran all system updates on DSpace Test and rebooted it

    -

    2018-08-01

    - +

    2018-08-01

    [Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
     [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
     [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    -
    - -
  • Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight

  • - -
  • From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat’s

  • - -
  • I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…

  • - -
  • Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core

  • - -
  • The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes

  • - -
  • I ran all system updates on DSpace Test and rebooted it

  • + - - -

    2018-08-02

    - + + +

    2018-08-02

    [Thu Aug  2 00:00:12 2018] Out of memory: Kill process 1407 (java) score 787 or sacrifice child
     [Thu Aug  2 00:00:12 2018] Killed process 1407 (java) total-vm:18876328kB, anon-rss:6323836kB, file-rss:0kB, shmem-rss:0kB
    -
    - -
  • I am still assuming that this is the Tomcat process that is dying, so maybe actually we need to reduce its memory instead of increasing it?

  • - -
  • The risk we run there is that we’ll start getting OutOfMemory errors from Tomcat

  • - -
  • So basically we need a new test server with more RAM very soon…

  • - -
  • Abenet asked about the workflow statistics in the Atmire CUA module again

  • - -
  • Last year Atmire told me that it’s disabled by default but you can enable it with workflow.stats.enabled = true in the CUA configuration file

  • - -
  • There was a bug with adding users so they sent a patch, but I didn’t merge it because it was very dirty and I wasn’t sure it actually fixed the problem

  • - -
  • I just tried to enable the stats again on DSpace Test now that we’re on DSpace 5.8 with updated Atmire modules, but every user I search for shows “No data available”

  • - -
  • As a test I submitted a new item and I was able to see it in the workflow statistics “data” tab, but not in the graph

  • + - -

    2018-08-15

    - +

    2018-08-15

    $ ./fix-metadata-values.py -i 2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
     $ ./delete-metadata-values.py -i 2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
    -
    - - -

    2018-08-16

    - +

    2018-08-16

    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc limit 1500) to /tmp/2018-08-16-top-1500-authors.csv with csv; 
    -
    - -
  • Start working on adding the ORCID metadata to a handful of CIAT authors as requested by Elizabeth earlier this month

  • - -
  • I might need to overhaul the add-orcid-identifiers-csv.py script to be a little more robust about author order and ORCID metadata that might have been altered manually by editors after submission, as this script was written without that consideration

  • - -
  • After checking a few examples I see that checking only the text_value and place when adding ORCID fields is not enough anymore

  • - -
  • It was a sane assumption when I was initially migrating ORCID records from Solr to regular metadata, but now it seems that some authors might have been added or changed after item submission

  • - -
  • Now it is better to check if there is any existing ORCID identifier for a given author for the item…

  • - -
  • I will have to update my script to extract the ORCID identifier and search for that

  • - -
  • Re-create my local DSpace database using the latest PostgreSQL 9.6 Docker image and re-import the latest CGSpace dump:

    - +
    $ sudo docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
     $ createuser -h localhost -U postgres --pwprompt dspacetest
     $ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
    @@ -240,18 +194,13 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
     $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest ~/Downloads/cgspace_2018-08-16.backup
     $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
     $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
    -
  • - - -

    2018-08-19

    - +

    2018-08-19

    Verchot, Louis
     Verchot, L
     Verchot, L. V.
    @@ -259,12 +208,10 @@ Verchot, L.V
     Verchot, L.V.
     Verchot, LV
     Verchot, Louis V.
    -
    - -
  • I’ll just tag them all with Louis Verchot’s ORCID identifier…

  • - -
  • In the end, I’ll run the following CSV with my add-orcid-identifiers-csv.py script:

    - +
    dc.contributor.author,cg.creator.id
     "Campbell, Bruce",Bruce M Campbell: 0000-0002-0123-4859
     "Campbell, Bruce M.",Bruce M Campbell: 0000-0002-0123-4859
    @@ -293,81 +240,66 @@ Verchot, Louis V.
     "Chirinda, Ngonidzashe",Ngonidzashe Chirinda: 0000-0002-4213-6294
     "Chirinda, Ngoni",Ngonidzashe Chirinda: 0000-0002-4213-6294
     "Ngonidzashe, Chirinda",Ngonidzashe Chirinda: 0000-0002-4213-6294
    -
  • - -
  • The invocation would be:

    - +
    $ ./add-orcid-identifiers-csv.py -i 2018-08-16-ciat-orcid.csv -db dspace -u dspace -p 'fuuu'
    -
  • - -
  • I ran the script on DSpace Test and CGSpace and tagged a total of 986 ORCID identifiers

  • - -
  • Looking at the list of author affialitions from Peter one last time

  • - -
  • I notice that I should add the Unicode character 0x00b4 (`) to my list of invalid characters to look for in Open Refine, making the latest version of the GREL expression being:

    - +
    or(
    -isNotNull(value.match(/.*\uFFFD.*/)),
    -isNotNull(value.match(/.*\u00A0.*/)),
    -isNotNull(value.match(/.*\u200A.*/)),
    -isNotNull(value.match(/.*\u2019.*/)),
    -isNotNull(value.match(/.*\u00b4.*/))
    +  isNotNull(value.match(/.*\uFFFD.*/)),
    +  isNotNull(value.match(/.*\u00A0.*/)),
    +  isNotNull(value.match(/.*\u200A.*/)),
    +  isNotNull(value.match(/.*\u2019.*/)),
    +  isNotNull(value.match(/.*\u00b4.*/))
     )
    -
  • - -
  • This character all by itself is indicative of encoding issues in French, Italian, and Spanish names, for example: De´veloppement and Investigacio´n

  • - -
  • I will run the following on DSpace Test and CGSpace:

    - +
    $ ./fix-metadata-values.py -i /tmp/2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
     $ ./delete-metadata-values.py -i /tmp/2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
    -
  • - -
  • Then force an update of the Discovery index on DSpace Test:

    - +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    72m12.570s
     user    6m45.305s
     sys     2m2.461s
    -
  • - -
  • And then on CGSpace:

    - +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    79m44.392s
     user    8m50.730s
     sys     2m20.248s
    -
  • - -
  • Run system updates on DSpace Test and reboot the server

  • - -
  • In unrelated news, I see some newish Russian bot making a few thousand requests per day and not re-using its XMLUI session:

    - +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep '19/Aug/2018' | grep -c 5.9.6.51
     1553
     # grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-08-19
     1724
    -
  • - -
  • I don’t even know how its possible for the bot to use MORE sessions than total requests…

  • - -
  • The user agent is:

    - -
    Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
    -
  • - -
  • So I’m thinking we should add “crawl” to the Tomcat Crawler Session Manager valve, as we already have “bot” that catches Googlebot, Bingbot, etc.

  • + - -

    2018-08-20

    - +
    Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
    +
    +

    2018-08-20

    - -

    2018-08-21

    - +

    2018-08-21

    - -

    2018-08-23

    - +
    [INFO] Processing overlay [ id org.dspace.modules:xmlui-mirage2]
    +
    +
    [INFO] --- maven-war-plugin:2.4:war (default-war) @ xmlui ---
    +
    +

    2018-08-23

    - -

    2018-08-26

    - +
    $ dspace import -a -e s.webshet@cgiar.org -s /home/swebshet/ictupdates_uploads_August_21 -m /tmp/2018-08-23-cta-ictupdates.map
    +

    2018-08-26

    $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-08-26-before-dspace-58.backup dspace
     $ dspace cleanup -v
    -
    - -
  • Now I can stop Tomcat and do the install:

    - +
    $ cd dspace/target/dspace-installer
     $ ant update clean_backups update_geolite
    -
  • - -
  • After the successful Ant update I can run the database migrations:

    - +
    $ psql dspace dspace
     
     dspace=> \i /tmp/Atmire-DSpace-5.8-Schema-Migration.sql 
    @@ -461,74 +369,51 @@ DELETE 1
     dspace=> \q
     
     $ dspace database migrate ignored
    -
  • - -
  • Then I’ll run all system updates and reboot the server:

    - +
    $ sudo su -
     # apt update && apt full-upgrade
     # apt clean && apt autoclean && apt autoremove
     # reboot
    -
  • - -
  • After reboot I logged in and cleared all the XMLUI caches and everything looked to be working fine

  • - -
  • Adam from WLE had asked a few weeks ago about getting the metadata for a bunch of items related to gender from 2013 until now

  • - -
  • They want a CSV with all metadata, which the Atmire Listings and Reports module can’t do

  • - -
  • I exported a list of items from Listings and Reports with the following criteria: from year 2013 until now, have WLE subject GENDER or GENDER POVERTY AND INSTITUTIONS, and CRP Water, Land and Ecosystems

  • - -
  • Then I extracted the Handle links from the report so I could export each item’s metadata as CSV

    - -
    $ grep -o -E "[0-9]{5}/[0-9]{0,5}" listings-export.txt > /tmp/iwmi-gender-items.txt
    -
  • - -
  • Then on the DSpace server I exported the metadata for each item one by one:

    - -
    $ while read -r line; do dspace metadata-export -f "/tmp/${line/\//-}.csv" -i $line; sleep 2; done < /tmp/iwmi-gender-items.txt
    -
  • - -
  • But from here I realized that each of the fifty-nine items will have different columns in their CSVs, making it difficult to combine them

  • - -
  • I’m not sure how to proceed without writing some script to parse and join the CSVs, and I don’t think it’s worth my time

  • - -
  • I tested DSpace 5.8 in Tomcat 8.5.32 and it seems to work now, so I’m not sure why I got those errors last time I tried

  • - -
  • It could have been a configuration issue, though, as I also reconciled the server.xml with the one in our Ansible infrastructure scripts

  • - -
  • But now I can start testing and preparing to move DSpace Test to Ubuntu 18.04 + Tomcat 8.5 + OpenJDK + PostgreSQL 9.6…

  • - -
  • Actually, upon closer inspection, it seems that when you try to go to Listings and Reports under Tomcat 8.5.33 you are taken to the JSPUI login page despite having already logged in in XMLUI

  • - -
  • If I type my username and password again it does take me to Listings and Reports, though…

  • - -
  • I don’t see anything interesting in the Catalina or DSpace logs, so I might have to file a bug with Atmire

  • - -
  • For what it’s worth, the Content and Usage (CUA) module does load, though I can’t seem to get any results in the graph

  • - -
  • I just checked to see if the Listings and Reports issue with using the CGSpace citation field was fixed as planned alongside the DSpace 5.8 upgrades (#589

  • - -
  • I was able to create a new layout containing only the citation field, so I closed the ticket

  • + - -

    2018-08-29

    - +
    $ grep -o -E "[0-9]{5}/[0-9]{0,5}" listings-export.txt > /tmp/iwmi-gender-items.txt
    +
    +
    $ while read -r line; do dspace metadata-export -f "/tmp/${line/\//-}.csv" -i $line; sleep 2; done < /tmp/iwmi-gender-items.txt
    +
    +

    2018-08-29

    - -

    2018-08-30

    - +

    2018-08-30

    - - + diff --git a/docs/2018-09/index.html b/docs/2018-09/index.html index cd56c5bf6..f8ce191ed 100644 --- a/docs/2018-09/index.html +++ b/docs/2018-09/index.html @@ -8,11 +8,10 @@ @@ -23,13 +22,12 @@ I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I - + @@ -110,15 +108,13 @@ I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I

    -

    2018-09-02

    - +

    2018-09-02

    -
    02-Sep-2018 11:18:52.678 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
      java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
         at org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener.contextInitialized(DSpaceKernelServletContextListener.java:92)
    @@ -136,107 +132,95 @@ I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I
         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
         at java.lang.Thread.run(Thread.java:748)
     Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations:
    -
    - - +

    2018-09-10

    version: 1
     
     requests:
    -test:
    -method: GET
    -url: https://dspacetest.cgiar.org/rest/test
    -validate:
    -  raw: "REST api is running."
    +  test:
    +    method: GET
    +    url: https://dspacetest.cgiar.org/rest/test
    +    validate:
    +      raw: "REST api is running."
     
    -login:
    -url: https://dspacetest.cgiar.org/rest/login
    -method: POST
    -data:
    -  json: {"email":"test@dspace","password":"thepass"}
    +  login:
    +    url: https://dspacetest.cgiar.org/rest/login
    +    method: POST
    +    data:
    +      json: {"email":"test@dspace","password":"thepass"}
     
    -status:
    -url: https://dspacetest.cgiar.org/rest/status
    -method: GET
    -headers:
    -  rest-dspace-token: Value(login)
    +  status:
    +    url: https://dspacetest.cgiar.org/rest/status
    +    method: GET
    +    headers:
    +      rest-dspace-token: Value(login)
     
    -logout:
    -url: https://dspacetest.cgiar.org/rest/logout
    -method: POST
    -headers:
    -  rest-dspace-token: Value(login)
    +  logout:
    +    url: https://dspacetest.cgiar.org/rest/logout
    +    method: POST
    +    headers:
    +      rest-dspace-token: Value(login)
     
     # vim: set sw=2 ts=2:
    -
    - -
  • Works pretty well, though the DSpace logout always returns an HTTP 415 error for some reason

  • - -
  • We could eventually use this to test sanity of the API for creating collections etc

  • - -
  • A user is getting an error in her workflow:

    - +
    2018-09-10 07:26:35,551 ERROR org.dspace.submit.step.CompleteStep @ Caught exception in submission step: 
     org.dspace.authorize.AuthorizeException: Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:2 by user 3819
    -
  • - -
  • Seems to be during submit step, because it’s workflow step 1…?

  • - -
  • Move some top-level CRP communities to be below the new CGIAR Research Programs and Platforms community:

    - +
    $ dspace community-filiator --set -p 10568/97114 -c 10568/51670
     $ dspace community-filiator --set -p 10568/97114 -c 10568/35409
     $ dspace community-filiator --set -p 10568/97114 -c 10568/3112
    -
  • - -
  • Valerio contacted me to point out some issues with metadata on CGSpace, which I corrected in PostgreSQL:

    - +
    update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI Juornal';
     UPDATE 1
     update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI journal';
    @@ -247,46 +231,37 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id=226 and
     DELETE 17
     update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI';
     UPDATE 15
    -
  • - -
  • Start working on adding metadata for access and usage rights that we started earlier in 2018 (and also in 2017)

  • - -
  • The current cg.identifier.status field will become “Access rights” and dc.rights will become “Usage rights”

  • - -
  • I have some work in progress on the 5_x-rights branch

  • - -
  • Linode said that CGSpace (linode18) had a high CPU load earlier today

  • - -
  • When I looked, I see it’s the same Russian IP that I noticed last month:

    - +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Sep/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -1459 157.55.39.202
    -1579 95.108.181.88
    -1615 157.55.39.147
    -1714 66.249.64.91
    -1924 50.116.102.77
    -3696 157.55.39.106
    -3763 157.55.39.148
    -4470 70.32.83.92
    -4724 35.237.175.180
    -14132 5.9.6.51
    -
  • - -
  • And this bot is still creating more Tomcat sessions than Nginx requests (WTF?):

    - + 1459 157.55.39.202 + 1579 95.108.181.88 + 1615 157.55.39.147 + 1714 66.249.64.91 + 1924 50.116.102.77 + 3696 157.55.39.106 + 3763 157.55.39.148 + 4470 70.32.83.92 + 4724 35.237.175.180 + 14132 5.9.6.51 +
    # grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-09-10 
     14133
    -
  • - -
  • The user agent is still the same:

    - +
    Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
    -
  • - -
  • I added .*crawl.* to the Tomcat Session Crawler Manager Valve, so I’m not sure why the bot is creating so many sessions…

  • - -
  • I just tested that user agent on CGSpace and it does not create a new session:

    - +
    $ http --print Hh https://cgspace.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)'
     GET / HTTP/1.1
     Accept: */*
    @@ -309,288 +284,219 @@ X-Cocoon-Version: 2.2.0
     X-Content-Type-Options: nosniff
     X-Frame-Options: SAMEORIGIN
     X-XSS-Protection: 1; mode=block
    -
  • - -
  • I will have to keep an eye on it and perhaps add it to the list of “bad bots” that get rate limited

  • + - -

    2018-09-12

    - +

    2018-09-12

    $ sudo docker volume create --name dspacetest_data
     $ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
    -
    - -
  • Sisay is still having problems with the controlled vocabulary for top authors

  • - -
  • I took a look at the submission template and Firefox complains that the XML file is missing a root element

  • - -
  • I guess it’s because Firefox is receiving an empty XML file

  • - -
  • I told Sisay to run the XML file through tidy

  • - -
  • More testing of the access and usage rights changes

  • + - -

    2018-09-13

    - +

    2018-09-13

    # zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E "13/Sep/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10                                                                                                
    - 32 46.229.161.131
    - 38 104.198.9.108
    - 39 66.249.64.91
    - 56 157.55.39.224
    - 57 207.46.13.49
    - 58 40.77.167.120
    - 78 169.255.105.46
    -702 54.214.112.202
    -1840 50.116.102.77
    -4469 70.32.83.92
    -
    - -
  • And the top two addresses seem to be re-using their Tomcat sessions properly:

    - + 32 46.229.161.131 + 38 104.198.9.108 + 39 66.249.64.91 + 56 157.55.39.224 + 57 207.46.13.49 + 58 40.77.167.120 + 78 169.255.105.46 + 702 54.214.112.202 + 1840 50.116.102.77 + 4469 70.32.83.92 +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=70.32.83.92' dspace.log.2018-09-13 | sort | uniq
     7
     $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-13 | sort | uniq
     2
    -
  • - -
  • So I’m not sure what’s going on

  • - -
  • Valerio asked me if there’s a way to get the page views and downloads from CGSpace

  • - -
  • I said no, but that we might be able to piggyback on the Atmire statlet REST API

  • - -
  • For example, when you expand the “statlet” at the bottom of an item like 1056897103 you can see the following request in the browser console:

    - -
    https://cgspace.cgiar.org/rest/statlets?handle=10568/97103
    -
  • - -
  • That JSON file has the total page views and item downloads for the item…

  • - -
  • Abenet forwarded a request by CIP that item thumbnails be included in RSS feeds

  • - -
  • I had a quick look at the DSpace 5.x manual and it doesn’t not seem that this is possible (you can only add metadata)

  • - -
  • Testing the new LDAP server the CGNET says will be replacing the old one, it doesn’t seem that they are using the global catalog on port 3269 anymore, now only 636 is open

  • - -
  • I did a clean deploy of DSpace 5.8 on Ubuntu 18.04 with some stripped down Tomcat 8 configuration and actually managed to get it up and running without the autowire errors that I had previously experienced

  • - -
  • I realized that it always works on my local machine with Tomcat 8.5.x, but not when I do the deployment from Ansible in Ubuntu 18.04

  • - -
  • So there must be something in my Tomcat 8 server.xml template

  • - -
  • Now I re-deployed it with the normal server template and it’s working, WTF?

  • - -
  • Must have been something like an old DSpace 5.5 file in the spring folder… weird

  • - -
  • But yay, this means we can update DSpace Test to Ubuntu 18.04, Tomcat 8, PostgreSQL 9.6, etc…

  • + - -

    2018-09-14

    - +
    https://cgspace.cgiar.org/rest/statlets?handle=10568/97103
    +
    +

    2018-09-14

    - -

    2018-09-16

    - +

    2018-09-16

    - -

    2018-09-17

    - +

    2018-09-17

    +
  • Update these immediately, but talk to CodeObia to create a mapping between the old and new values
  • Finalize dc.rights “Usage rights” with seven combinations of Creative Commons, plus the others
  • -
  • Need to double check the new CRP community to see why the collection counts aren’t updated after we moved the communities there last week - +
  • Need to double check the new CRP community to see why the collection counts aren't updated after we moved the communities there last week
  • -
  • Check if it’s possible to have items deposited via REST use a workflow so we can perhaps tell ICARDA to use that from MEL
  • -
  • Agree that we’ll publicize AReS explorer on the week before the Big Data Platform workshop - + +
  • +
  • Check if it's possible to have items deposited via REST use a workflow so we can perhaps tell ICARDA to use that from MEL
  • +
  • Agree that we'll publicize AReS explorer on the week before the Big Data Platform workshop
  • + +
  • I want to explore creating a thin API to make the item view and download stats available from Solr so CodeObia can use them in the AReS explorer
  • -
  • Currently CodeObia is exploring using the Atmire statlets internal API, but I don’t really like that…
  • +
  • Currently CodeObia is exploring using the Atmire statlets internal API, but I don't really like that…
  • There are some example queries on the DSpace Solr wiki
  • - -
  • For example, this query returns 1655 rows for item 1056810630:

    - +
  • For example, this query returns 1655 rows for item 10568/10630:
  • +
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false'
    -
    - -
  • The id in the Solr query is the item’s database id (get it from the REST API or something)

  • - -
  • Next, I adopted a query to get the downloads and it shows 889, which is similar to the number Atmire’s statlet shows, though the query logic here is confusing:

    - +
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=-(bundleName:[*+TO+*]-bundleName:ORIGINAL)&fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
    -
  • - -
  • According to the SolrQuerySyntax page on the Apache wiki, the [* TO *] syntax just selects a range (in this case all values for a field)

  • - -
  • So it seems to be:

    - + - -

    2018-09-18

    - +
  • +
  • What the shit, I think I'm right: the simplified logic in this query returns the same 889:
  • + +
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=bundleName:ORIGINAL&fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
    +
    +
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=bundleName:ORIGINAL&fq=statistics_type:view'
    +
    +
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=-bundleName:ORIGINAL&fq=statistics_type:view'
    +
    +

    2018-09-18

    $ http -b 'https://dspacetest.cgiar.org/rest/statistics/item?id=110988'
     {
    -"downloads": 2,
    -"id": 110988,
    -"views": 15
    +    "downloads": 2,
    +    "id": 110988,
    +    "views": 15
     }
    -
    - -
  • The numbers are different than those that come from Atmire’s statlets for some reason, but as I’m querying Solr directly, I have no idea where their numbers come from!

  • - -
  • Moayad from CodeObia asked if I could make the API be able to paginate over all items, for example: /statistics?limit=100&page=1

  • - -
  • Getting all the item IDs from PostgreSQL is certainly easy:

    - -
    dspace=# select item_id from item where in_archive is True and withdrawn is False and discoverable is True;
    -
  • - -
  • The rest of the Falcon tooling will be more difficult…

  • + - -

    2018-09-19

    - +
    dspace=# select item_id from item where in_archive is True and withdrawn is False and discoverable is True;
    +
    +

    2018-09-19

    - -

    2018-09-20

    - +

    2018-09-20

    - -

    2018-09-21

    - +
    ((maxDoc/8) + 128) * (size_defined_in_solrconfig.xml)
    +
    +
    ((149374568/8) + 128) * 512 = 9560037888 bytes (8.9 GB)
    +
    +

    2018-09-21

    - -

    2019-09-23

    - + +
  • I think it would also be nice to cherry-pick the fixes for DS-3883, which is related to optimizing the XMLUI item display of items with many bitstreams +
  • + +

    2019-09-23

    + - -

    2018-09-24

    - +

    2018-09-24

    > SELECT views.views, views.id, downloads.downloads, downloads.id FROM itemviews views
     LEFT JOIN itemdownloads downloads USING(id)
     UNION ALL
     SELECT views.views, views.id, downloads.downloads, downloads.id FROM itemdownloads downloads
     LEFT JOIN itemviews views USING(id)
     WHERE views.id IS NULL;
    -
    - -
  • This “works” but the resulting rows are kinda messy so I’d have to do extra logic in Python

  • - -
  • Maybe we can use one “items” table with defaults values and UPSERT (aka insert… on conflict … do update):

    - +
    sqlite> CREATE TABLE items(id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0);
     sqlite> INSERT INTO items(id, views) VALUES(0, 52);
     sqlite> INSERT INTO items(id, downloads) VALUES(1, 171);
    @@ -598,32 +504,24 @@ sqlite> INSERT INTO items(id, downloads) VALUES(1, 176) ON CONFLICT(id) DO UP
     sqlite> INSERT INTO items(id, views) VALUES(0, 78) ON CONFLICT(id) DO UPDATE SET views=78;
     sqlite> INSERT INTO items(id, views) VALUES(0, 3) ON CONFLICT(id) DO UPDATE SET downloads=3;
     sqlite> INSERT INTO items(id, views) VALUES(0, 7) ON CONFLICT(id) DO UPDATE SET downloads=excluded.views;
    -
  • - -
  • This totally works!

  • - -
  • Note the special excluded.views form! See SQLite’s lang_UPSERT documentation

  • - -
  • Oh nice, I finally finished the Falcon API route to page through all the results using SQLite’s amazing LIMIT and OFFSET support

  • - -
  • But when I deployed it on my Ubuntu 16.04 environment I realized Ubuntu’s SQLite is old and doesn’t support UPSERT, so my indexing doesn’t work…

  • - -
  • Apparently UPSERT came in SQLite 3.24.0 (2018-06-04), and Ubuntu 16.04 has 3.11.0

  • - -
  • Ok this is hilarious, I manually downloaded the libsqlite3 3.24.0 deb from Ubuntu 18.10 “cosmic” and installed it in Ubnutu 16.04 and now the Python indexer.py works

  • - -
  • This is definitely a dirty hack, but the list of packages we use that depend on libsqlite3-0 in Ubuntu 16.04 are actually pretty few:

    - +
    # apt-cache rdepends --installed libsqlite3-0 | sort | uniq
    -gnupg2
    -libkrb5-26-heimdal
    -libnss3
    -libpython2.7-stdlib
    -libpython3.5-stdlib
    -
  • - -
  • I wonder if I could work around this by detecting the SQLite library version, for example on Ubuntu 16.04 after I replaced the library:

    - + gnupg2 + libkrb5-26-heimdal + libnss3 + libpython2.7-stdlib + libpython3.5-stdlib +
    # python3
     Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
     [GCC 5.4.0 20160609] on linux
    @@ -631,266 +529,197 @@ Type "help", "copyright", "credits" or "licen
     >>> import sqlite3
     >>> print(sqlite3.sqlite_version)
     3.24.0
    -
  • - -
  • Or maybe I should just bite the bullet and migrate this to PostgreSQL, as it supports UPSERT since version 9.5 and also seems to have my new favorite LIMIT and OFFSET

  • - -
  • I changed the syntax of the SQLite stuff and PostgreSQL is working flawlessly with psycopg2… hmmm.

  • - -
  • For reference, creating a PostgreSQL database for testing this locally (though indexer.py will create the table):

    - +
    $ createdb -h localhost -U postgres -O dspacestatistics --encoding=UNICODE dspacestatistics
     $ createuser -h localhost -U postgres --pwprompt dspacestatistics
     $ psql -h localhost -U postgres dspacestatistics
     dspacestatistics=> CREATE TABLE IF NOT EXISTS items
     dspacestatistics-> (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0)
    -
  • - - -

    2018-09-25

    - +

    2018-09-25

    $ dspace stats-util -f
    -
    - -
  • The command comes back after a few seconds and I still see 2,000,000 documents in the statistics core with isBot:true

  • - -
  • I was just writing a message to the dspace-tech mailing list and then I decided to check the number of bot view events on DSpace Test again, and now it’s 201 instead of 2,000,000, and statistics core is only 30MB now!

  • - -
  • I will set the logBots = false property in dspace/config/modules/usage-statistics.cfg on DSpace Test and check if the number of isBot:true events goes up any more…

  • - -
  • I restarted the server with logBots = false and after it came back up I see 266 events with isBots:true (maybe they were buffered)… I will check again tomorrow

  • - -
  • After a few hours I see there are still only 266 view events with isBot:true on DSpace Test’s Solr statistics core, so I’m definitely going to deploy this on CGSpace soon

  • - -
  • Also, CGSpace currently has 60,089,394 view events with isBot:true in it’s Solr statistics core and it is 124GB!

  • - -
  • Amazing! After running dspace stats-util -f on CGSpace the Solr statistics core went from 124GB to 60GB, and now there are only 700 events with isBot:true so I should really disable logging of bot events!

  • - -
  • I’m super curious to see how the JVM heap usage changes…

  • - -
  • I made (and merged) a pull request to disable bot logging on the 5_x-prod branch (#387)

  • - -
  • Now I’m wondering if there are other bot requests that aren’t classified as bots because the IP lists or user agents are outdated

  • - -
  • DSpace ships a list of spider IPs, for example: config/spiders/iplists.com-google.txt

  • - -
  • I checked the list against all the IPs we’ve seen using the “Googlebot” useragent on CGSpace’s nginx access logs

  • - -
  • The first thing I learned is that shit tons of IPs in Russia, Ukraine, Ireland, Brazil, Portugal, the US, Canada, etc are pretending to be “Googlebot”…

  • - -
  • According to the Googlebot FAQ the domain name in the reverse DNS lookup should contain either googlebot.com or google.com

  • - -
  • In Solr this appears to be an appropriate query that I can maybe use later (returns 81,000 documents):

    - +
    *:* AND (dns:*googlebot.com. OR dns:*google.com.) AND isBot:false
    -
  • - -
  • I translate that into a delete command using the /update handler:

    - +
    http://localhost:8081/solr/statistics/update?commit=true&stream.body=<delete><query>*:*+AND+(dns:*googlebot.com.+OR+dns:*google.com.)+AND+isBot:false</query></delete>
    -
  • - -
  • And magically all those 81,000 documents are gone!

  • - -
  • After a few hours the Solr statistics core is down to 44GB on CGSpace!

  • - -
  • I did a major refactor and logic fix in the DSpace Statistics API’s indexer.py

  • - -
  • Basically, it turns out that using facet.mincount=1 is really beneficial for me because it reduces the size of the Solr result set, reduces the amount of data we need to ingest into PostgreSQL, and the API returns HTTP 404 Not Found for items without views or downloads anyways

  • - -
  • I deployed the new version on CGSpace and now it looks pretty good!

    - +
    Indexing item views (page 28 of 753)
     ...
     Indexing item downloads (page 260 of 260)
    -
  • - -
  • And now it’s fast as hell due to the muuuuch smaller Solr statistics core

  • + - -

    2018-09-26

    - +

    2018-09-26

    - -

    Tomcat max processing time week

    - +

    Tomcat max processing time week

    $ ./fix-metadata-values.py -i /tmp/fix-access-status.csv -db dspace -u dspace -p 'fuuu' -f cg.identifier.status -t correct -m 206
    -
    - -
  • This changes “Open Access” to “Unrestricted Access” and “Limited Access” to “Restricted Access”

  • - -
  • After that I did a full Discovery reindex:

    - +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    77m3.755s
     user    7m39.785s
     sys     2m18.485s
    -
  • - -
  • I told Peter it’s better to do the access rights before the usage rights because the git branches are conflicting with each other and it’s actually a pain in the ass to keep changing the values as we discuss, rebase, merge, fix conflicts…

  • - -
  • Udana and Mia from WLE were asking some questions about their WLE Feedburner feed

  • - -
  • It’s pretty confusing, because until recently they were entering issue dates as only YYYY (like 2018) and their feeds were all showing items in the wrong order

  • - -
  • I’m not exactly sure what their problem now is, though (confusing)

  • - -
  • I updated the dspace-statistiscs-api to use psycopg2’s execute_values() to insert batches of 100 values into PostgreSQL instead of doing every insert individually

  • - -
  • On CGSpace this reduces the total run time of indexer.py from 432 seconds to 400 seconds (most of the time is actually spent in getting the data from Solr though)

  • + - -

    2018-09-27

    - +

    2018-09-27

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "26/Sep/2018:(19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -295 34.218.226.147
    -296 66.249.64.95
    -350 157.55.39.185
    -359 207.46.13.28
    -371 157.55.39.85
    -388 40.77.167.148
    -444 66.249.64.93
    -544 68.6.87.12
    -834 66.249.64.91
    -902 35.237.175.180
    -
    - -
  • 35.237.175.180 is on Google Cloud

  • - -
  • 68.6.87.12 is on Cox Communications in the US (?)

  • - -
  • These hosts are not using proper user agents and are not re-using their Tomcat sessions:

    - + 295 34.218.226.147 + 296 66.249.64.95 + 350 157.55.39.185 + 359 207.46.13.28 + 371 157.55.39.85 + 388 40.77.167.148 + 444 66.249.64.93 + 544 68.6.87.12 + 834 66.249.64.91 + 902 35.237.175.180 +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-09-26 | sort | uniq
     5423
     $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12' dspace.log.2018-09-26 | sort | uniq
     758
    -
  • - -
  • I will add their IPs to the list of bad bots in nginx so we can add a “bot” user agent to them and let Tomcat’s Crawler Session Manager Valve handle them

  • - -
  • I asked Atmire to prepare an invoice for 125 credits

  • + - -

    2018-09-29

    - +

    2018-09-29

    $ ./fix-metadata-values.py -i 2018-09-29-fix-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
     $ ./fix-metadata-values.py -i 2018-09-29-fix-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
    -
    - -
  • Afterwards I started a full Discovery re-index:

    - -
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    -
  • - -
  • Linode sent an alert that both CGSpace and DSpace Test were using high CPU for the last two hours

  • - -
  • It seems to be Moayad trying to do the AReS explorer indexing

  • - -
  • He was sending too many (5 or 10) concurrent requests to the server, but still… why is this shit so slow?!

  • + - -

    2018-09-30

    - +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    +

    2018-09-30

    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2018-09-30-languages.csv with csv;
    -
    - -
  • Then I can simply delete the “Other” and “other” ones because that’s not useful at all:

    - +
    dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='Other';
     DELETE 6
     dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='other';
     DELETE 79
    -
  • - -
  • Looking through the list I see some weird language codes like gh, so I checked out those items:

    - +
    dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
    -resource_id
    + resource_id
     -------------
    -   94530
    -   94529
    +       94530
    +       94529
     dspace=# SELECT handle,item_id FROM item, handle WHERE handle.resource_type_id=2 AND handle.resource_id = item.item_id AND handle.resource_id in (94530, 94529);
    -handle    | item_id
    +   handle    | item_id
     -------------+---------
    -10568/91386 |   94529
    -10568/91387 |   94530
    -
  • - -
  • Those items are from Ghana, so the submitter apparently thought gh was a language… I can safely delete them:

    - + 10568/91386 | 94529 + 10568/91387 | 94530 +
    dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
     DELETE 2
    -
  • - -
  • The next issue would be jn:

    - +
    dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jn';
    -resource_id
    + resource_id
     -------------
    -   94001
    -   94003
    +       94001
    +       94003
     dspace=# SELECT handle,item_id FROM item, handle WHERE handle.resource_type_id=2 AND handle.resource_id = item.item_id AND handle.resource_id in (94001, 94003);
    -handle    | item_id
    +   handle    | item_id
     -------------+---------
    -10568/90868 |   94001
    -10568/90870 |   94003
    -
  • - -
  • Those items are about Japan, so I will update them to be ja

  • - -
  • Other replacements:

    - + 10568/90868 | 94001 + 10568/90870 | 94003 +
    DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
     UPDATE metadatavalue SET text_value='fr' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='fn';
     UPDATE metadatavalue SET text_value='hi' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='in';
     UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='Ja';
     UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jn';
     UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jp';
    -
  • - -
  • Then there are 12 items with en|hi, but they were all in one collection so I just exported it as a CSV and then re-imported the corrected metadata

  • + - - + diff --git a/docs/2018-10/index.html b/docs/2018-10/index.html index 503b82c9d..1b89de688 100644 --- a/docs/2018-10/index.html +++ b/docs/2018-10/index.html @@ -8,9 +8,8 @@ @@ -21,11 +20,10 @@ I created a GitHub issue to track this #389, because I’m super busy in Nai - + @@ -106,104 +104,86 @@ I created a GitHub issue to track this #389, because I’m super busy in Nai

    -

    2018-10-01

    - +

    2018-10-01

    - -

    2018-10-03

    - +

    2018-10-03

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | awk '{print $1}
     ' | sort | uniq -c | sort -n | tail -n 10
    -933 40.77.167.90
    -971 95.108.181.88
    -1043 41.204.190.40
    -1454 157.55.39.54
    -1538 207.46.13.69
    -1719 66.249.64.61
    -2048 50.116.102.77
    -4639 66.249.64.59
    -4736 35.237.175.180
    -150362 34.218.226.147
    -
    - -
  • Of those, about 20% were HTTP 500 responses (!):

    - + 933 40.77.167.90 + 971 95.108.181.88 + 1043 41.204.190.40 + 1454 157.55.39.54 + 1538 207.46.13.69 + 1719 66.249.64.61 + 2048 50.116.102.77 + 4639 66.249.64.59 + 4736 35.237.175.180 + 150362 34.218.226.147 +
    $ zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | grep 34.218.226.147 | awk '{print $9}' | sort -n | uniq -c
    -118927 200
    -31435 500
    -
  • - -
  • I added Phil Thornton and Sonal Henson’s ORCID identifiers to the controlled vocabulary for cg.creator.orcid and then re-generated the names using my resolve-orcids.py script:

    - + 118927 200 + 31435 500 +
    $ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq > 2018-10-03-orcids.txt
     $ ./resolve-orcids.py -i 2018-10-03-orcids.txt -o 2018-10-03-names.txt -d
    -
  • - -
  • I found a new corner case error that I need to check, given and family names deactivated:

    - +
    Looking up the names associated with ORCID iD: 0000-0001-7930-5752
     Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
    -
  • - -
  • It appears to be Jim Lorenzen… I need to check that later!

  • - -
  • I merged the changes to the 5_x-prod branch (#390)

  • - -
  • Linode sent another alert about CPU usage on CGSpace (linode18) this evening

  • - -
  • It seems that Moayad is making quite a lot of requests today:

    - +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Oct/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -1594 157.55.39.160
    -1627 157.55.39.173
    -1774 136.243.6.84
    -4228 35.237.175.180
    -4497 70.32.83.92
    -4856 66.249.64.59
    -7120 50.116.102.77
    -12518 138.201.49.199
    -87646 34.218.226.147
    -111729 213.139.53.62
    -
  • - -
  • But in super positive news, he says they are using my new dspace-statistics-api and it’s MUCH faster than using Atmire CUA’s internal “restlet” API

  • - -
  • I don’t recognize the 138.201.49.199 IP, but it is in Germany (Hetzner) and appears to be paginating over some browse pages and downloading bitstreams:

    - + 1594 157.55.39.160 + 1627 157.55.39.173 + 1774 136.243.6.84 + 4228 35.237.175.180 + 4497 70.32.83.92 + 4856 66.249.64.59 + 7120 50.116.102.77 + 12518 138.201.49.199 + 87646 34.218.226.147 + 111729 213.139.53.62 +
    # grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /[a-z]+' | sort | uniq -c
    -8324 GET /bitstream
    -4193 GET /handle
    -
  • - -
  • Suspiciously, it’s only grabbing the CGIAR System Office community (handle prefix 10947):

    - + 8324 GET /bitstream + 4193 GET /handle +
    # grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /handle/[0-9]{5}' | sort | uniq -c
    -  7 GET /handle/10568
    -4186 GET /handle/10947
    -
  • - -
  • The user agent is suspicious too:

    - + 7 GET /handle/10568 + 4186 GET /handle/10947 +
    Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
    -
  • - -
  • It’s clearly a bot and it’s not re-using its Tomcat session, so I will add its IP to the nginx bad bot list

  • - -
  • I looked in Solr’s statistics core and these hits were actually all counted as isBot:false (of course)… hmmm

  • - -
  • I tagged all of Sonal and Phil’s items with their ORCID identifiers on CGSpace using my add-orcid-identifiers.py script:

    - +
    $ ./add-orcid-identifiers-csv.py -i 2018-10-03-add-orcids.csv -db dspace -u dspace -p 'fuuu'
    -
  • - -
  • Where 2018-10-03-add-orcids.csv contained:

    - +
    dc.contributor.author,cg.creator.id
     "Henson, Sonal P.",Sonal Henson: 0000-0002-2002-5462
     "Henson, S.",Sonal Henson: 0000-0002-2002-5462
    @@ -213,105 +193,75 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
     "Thornton, Philip K.",Philip Thornton: 0000-0002-1854-0182
     "Thornton, Phillip",Philip Thornton: 0000-0002-1854-0182
     "Thornton, Phillip K.",Philip Thornton: 0000-0002-1854-0182
    -
  • - - -

    2018-10-04

    - +

    2018-10-04

    +
  • I see there are other bundles we might need to pay attention to: TEXT, @_LOGO-COLLECTION_@, @_LOGO-COMMUNITY_@, etc…
  • On a hunch I dropped the statistics table and re-indexed and now those two items above have no downloads
  • -
  • So it’s fixed, but I’m not sure why!
  • - -
  • Peter wants to know the number of API requests per month, which was about 250,000 in September (exluding statlet requests):

    - +
  • So it's fixed, but I'm not sure why!
  • +
  • Peter wants to know the number of API requests per month, which was about 250,000 in September (exluding statlet requests):
  • +
    # zcat --force /var/log/nginx/{oai,rest}.log* | grep -E 'Sep/2018' | grep -c -v 'statlets'
     251226
    -
    - -
  • I found a logic error in the dspace-statistics-api indexer.py script that was causing item views to be inserted into downloads

  • - -
  • I tagged version 0.4.2 of the tool and redeployed it on CGSpace

  • + - -

    2018-10-05

    - +

    2018-10-05

    - -

    2018-10-06

    - +

    2018-10-06

    - -

    2018-10-08

    - +

    2018-10-08

    - -

    2018-10-10

    - +

    2018-10-10

    $ dspace filter-media -v -f -i 10568/97613
     org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: not authorized `/tmp/impdfthumb5039464037201498062.pdf' @ error/constitute.c/ReadImage/412.
    -
    - -
  • I see there was an update to Ubuntu’s ImageMagick on 2018-10-05, so maybe something changed or broke?

  • - -
  • I get the same error when forcing filter-media to run on DSpace Test too, so it’s gotta be an ImageMagic bug

  • - -
  • The ImageMagick version is currently 8:6.8.9.9-7ubuntu5.13, and there is an Ubuntu Security Notice from 2018-10-04

  • - -
  • Wow, someone on Twitter posted about this breaking his web application (and it was retweeted by the ImageMagick acount!)

  • - -
  • I commented out the line that disables PDF thumbnails in /etc/ImageMagick-6/policy.xml:

    - -
    <!--<policy domain="coder" rights="none" pattern="PDF" />-->
    -
  • - -
  • This works, but I’m not sure what ImageMagick’s long-term plan is if they are going to disable ALL image formats…

  • - -
  • I suppose I need to enable a workaround for this in Ansible?

  • + - -

    2018-10-11

    - +
      <!--<policy domain="coder" rights="none" pattern="PDF" />-->
    +
    +

    2018-10-11

    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-10-11-top-1500-subject.csv WITH CSV HEADER;
     COPY 1500
    -
    - -
  • Give WorldFish advice about Handles because they are talking to some company called KnowledgeArc who recommends they do not use Handles!

  • - -
  • Last week I emailed Altmetric to ask if their software would notice mentions of our Handle in the format “handle:1056880775” because I noticed that the Land Portal does this

  • - -
  • Altmetric support responded to say no, but the reason is that Land Portal is doing even more strange stuff by not using <meta> tags in their page header, and using “dct:identifier” property instead of “dc:identifier”

  • - -
  • I re-created my local DSpace databse container using podman instead of Docker:

    - +
    $ mkdir -p ~/.local/lib/containers/volumes/dspacedb_data
     $ sudo podman create --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
     $ sudo podman start dspacedb
    @@ -321,106 +271,80 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
     $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2018-10-11.backup
     $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
     $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
    -
  • - -
  • I tried to make an Artifactory in podman, but it seems to have problems because Artifactory is distributed on the Bintray repository

  • - -
  • I can pull the docker.bintray.io/jfrog/artifactory-oss:latest image, but not start it

  • - -
  • I decided to use a Sonatype Nexus repository instead:

    - +
    $ mkdir -p ~/.local/lib/containers/volumes/nexus_data
     $ sudo podman run --name nexus -d -v /home/aorth/.local/lib/containers/volumes/nexus_data:/nexus_data -p 8081:8081 sonatype/nexus3
    -
  • - -
  • With a few changes to my local Maven settings.xml it is working well

  • - -
  • Generate a list of the top 10,000 authors for Peter Ballantyne to look through:

    - +
    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 3 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 10000) to /tmp/2018-10-11-top-10000-authors.csv WITH CSV HEADER;
     COPY 10000
    -
  • - -
  • CTA uploaded some infographics that are very tall and their thumbnails disrupt the item lists on the front page and in their communities and collections

  • - -
  • I decided to constrain the max height of these to 200px using CSS (#392)

  • + - -

    2018-10-13

    - +

    2018-10-13

    - -

    2018-10-14

    - +
    or(
    +  isNotNull(value.match(/.*\uFFFD.*/)),
    +  isNotNull(value.match(/.*\u00A0.*/)),
    +  isNotNull(value.match(/.*\u200A.*/)),
    +  isNotNull(value.match(/.*\u2019.*/)),
    +  isNotNull(value.match(/.*\u00b4.*/))
    +)
    +
    +
    $ ./fix-metadata-values.py -i 2018-10-11-top-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t CORRECT -m 3
    +
    +

    2018-10-14

    $ ./fix-metadata-values.py -i /tmp/2018-10-11-top-authors.csv -f dc.contributor.author -t CORRECT -m 3 -db dspace -u dspace -p 'fuuu'
    -
    - -
  • Run all system updates on CGSpace (linode19) and reboot the server

  • - -
  • After rebooting the server I noticed that Handles are not resolving, and the dspace-handle-server systemd service is not running (or rather, it exited with success)

  • - -
  • Restarting the service with systemd works for a few seconds, then the java process quits

  • - -
  • I suspect that the systemd service type needs to be forking rather than simple, because the service calls the default DSpace start-handle-server shell script, which uses nohup and & to background the java process

  • - -
  • It would be nice if there was a cleaner way to start the service and then just log to the systemd journal rather than all this hiding and log redirecting

  • - -
  • Email the Landportal.org people to ask if they would consider Dublin Core metadata tags in their page’s header, rather than the HTML properties they are using in their body

  • - -
  • Peter pointed out that some thumbnails were still not getting generated

    - + - -

    2018-10-15

    - +
  • +
  • I limited the tall thumbnails even further to 170px because Peter said CTA's were still too tall at 200px (#396)
  • + +

    2018-10-15

    + +
  • I ended up having some issues with podman and went back to Docker, so I had to re-create my containers:
  • +
    $ sudo docker run --name nexus --network dspace-build -d -v /home/aorth/.local/lib/containers/volumes/nexus_data:/nexus_data -p 8081:8081 sonatype/nexus3
     $ sudo docker run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
     $ createuser -h localhost -U postgres --pwprompt dspacetest
    @@ -429,21 +353,15 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
     $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2018-10-11.backup
     $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
     $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
    -
    - - -

    2018-10-16

    - +

    2018-10-16

    dspace=# \copy (SELECT (CASE when metadata_schema_id=1 THEN 'dc' WHEN metadata_schema_id=2 THEN 'cg' END) AS schema, element, qualifier, scope_note FROM metadatafieldregistry where metadata_schema_id IN (1,2)) TO /tmp/cgspace-schema.csv WITH CSV HEADER;
    -
    - -
  • Talking to the CodeObia guys about the REST API I started to wonder why it’s so slow and how I can quantify it in order to ask the dspace-tech mailing list for help profiling it

  • - -
  • Interestingly, the speed doesn’t get better after you request the same thing multiple times–it’s consistently bad on both CGSpace and DSpace Test!

    - +
    $ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
     ...
     0.35s user 0.06s system 1% cpu 25.133 total
    @@ -459,14 +377,11 @@ $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,b
     0.23s user 0.04s system 1% cpu 16.460 total
     0.24s user 0.04s system 1% cpu 21.043 total
     0.22s user 0.04s system 1% cpu 17.132 total
    -
  • - -
  • I should note that at this time CGSpace is using Oracle Java and DSpace Test is using OpenJDK (both version 8)

  • - -
  • I wonder if the Java garbage collector is important here, or if there are missing indexes in PostgreSQL?

  • - -
  • I switched DSpace Test to the G1GC garbage collector and tried again and now the results are worse!

    - +
    $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
     ...
     0.20s user 0.03s system 0% cpu 25.017 total
    @@ -474,29 +389,24 @@ $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,b
     0.24s user 0.02s system 1% cpu 22.496 total
     0.22s user 0.03s system 1% cpu 22.720 total
     0.23s user 0.03s system 1% cpu 22.632 total
    -
  • - -
  • If I make a request without the expands it is ten time faster:

    - +
    $ time http --print h 'https://dspacetest.cgiar.org/rest/items?limit=100&offset=0'
     ...
     0.20s user 0.03s system 7% cpu 3.098 total
     0.22s user 0.03s system 8% cpu 2.896 total
     0.21s user 0.05s system 9% cpu 2.787 total
     0.23s user 0.02s system 8% cpu 2.896 total
    -
  • - -
  • I sent a mail to dspace-tech to ask how to profile this…

  • + - -

    2018-10-17

    - +

    2018-10-17

    UPDATE metadatavalue SET text_value='CC-BY-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%CC BY %';
     UPDATE metadatavalue SET text_value='CC-BY-NC-ND-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%BY-NC-ND%' AND text_value LIKE '%by-nc-nd%';
     UPDATE metadatavalue SET text_value='CC-BY-NC-SA-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%BY-NC-SA%' AND text_value LIKE '%by-nc-sa%';
    @@ -513,115 +423,89 @@ UPDATE metadatavalue SET text_value='CC-BY-3.0' WHERE resource_type_id=2 AND met
     UPDATE metadatavalue SET text_value='CC-BY-ND-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND resource_id=78184;
     UPDATE metadatavalue SET text_value='CC-BY' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value NOT LIKE '%zero%' AND text_value NOT LIKE '%CC0%' AND text_value LIKE '%Attribution %' AND text_value NOT LIKE '%CC-%';
     UPDATE metadatavalue SET text_value='CC-BY-NC-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND resource_id=78564;
    -
    - -
  • I updated the fields on CGSpace and then started a re-index of Discovery

  • - -
  • We also need to re-think the dc.rights field in the submission form: we should probably use a popup controlled vocabulary and list the Creative Commons values with version numbers and allow the user to enter their own (like the ORCID identifier field)

  • - -
  • Ask Jane if we can use some of the BDP money to host AReS explorer on a more powerful server

  • - -
  • IWMI sent me a list of new ORCID identifiers for their staff so I combined them with our list, updated the names with my resolve-orcids.py script, and regenerated the controlled vocabulary:

    - +
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json MEL\ ORCID_V2.json 2018-10-17-IWMI-ORCIDs.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq >
     2018-10-17-orcids.txt
     $ ./resolve-orcids.py -i 2018-10-17-orcids.txt -o 2018-10-17-names.txt -d
     $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
    -
  • - -
  • I also decided to add the ORCID identifiers that MEL had sent us a few months ago…

  • - -
  • One problem I had with the resolve-orcids.py script is that one user seems to have disabled their profile data since we last updated:

    - +
    Looking up the names associated with ORCID iD: 0000-0001-7930-5752
     Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
    -
  • - -
  • So I need to handle that situation in the script for sure, but I’m not sure what to do organizationally or ethically, since that user disabled their name! Do we remove him from the list?

  • - -
  • I made a pull request and merged the ORCID updates into the 5_x-prod branch (#397)

  • - -
  • Improve the logic of name checking in my resolve-orcids.py script

  • + - -

    2018-10-18

    - +

    2018-10-18

    # su - postgres
     $ /usr/lib/postgresql/9.6/bin/pg_upgrade -b /usr/lib/postgresql/9.5/bin -B /usr/lib/postgresql/9.6/bin -d /var/lib/postgresql/9.5/main -D /var/lib/postgresql/9.6/main -o ' -c config_file=/etc/postgresql/9.5/main/postgresql.conf' -O ' -c config_file=/etc/postgresql/9.6/main/postgresql.conf'
     $ exit
     # systemctl start postgresql
     # dpkg -r postgresql-9.5 postgresql-client-9.5 postgresql-contrib-9.5
    -
    - - -

    2018-10-19

    - +

    2018-10-19

    - -

    2018-10-20

    - +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "19/Oct/2018:(12|13|14|15)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    361 207.46.13.179
    +    395 181.115.248.74
    +    485 66.249.64.93
    +    535 157.55.39.213
    +    536 157.55.39.99
    +    551 34.218.226.147
    +    580 157.55.39.173
    +   1516 35.237.175.180
    +   1629 66.249.64.91
    +   1758 5.9.6.51
    +
    +

    2018-10-20

    $ sudo docker pull solr:5
     $ sudo docker run --name my_solr -v ~/dspace/solr/statistics/conf:/tmp/conf -d -p 8983:8983 -t solr:5
     $ sudo docker logs my_solr
     ...
     ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics] Caused by: solr.IntField
    -
    - -
  • Apparently a bunch of variable types were removed in Solr 5

  • - -
  • So for now it’s actually a huge pain in the ass to run the tests for my dspace-statistics-api

  • - -
  • Linode sent a message that the CPU usage was high on CGSpace (linode18) last night

  • - -
  • According to the nginx logs around that time it was 5.9.6.51 (MegaIndex) again:

    - +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "20/Oct/2018:(14|15|16)" | awk '{print $1}' | sort
    -| uniq -c | sort -n | tail -n 10
    -249 207.46.13.179
    -250 157.55.39.173
    -301 54.166.207.223
    -303 157.55.39.213
    -310 66.249.64.95
    -362 34.218.226.147
    -381 66.249.64.93
    -415 35.237.175.180
    -1205 66.249.64.91
    -1227 5.9.6.51
    -
  • - -
  • This bot is only using the XMLUI and it does not seem to be re-using its sessions:

    - + | uniq -c | sort -n | tail -n 10 + 249 207.46.13.179 + 250 157.55.39.173 + 301 54.166.207.223 + 303 157.55.39.213 + 310 66.249.64.95 + 362 34.218.226.147 + 381 66.249.64.93 + 415 35.237.175.180 + 1205 66.249.64.91 + 1227 5.9.6.51 +
    # grep -c 5.9.6.51 /var/log/nginx/*.log
     /var/log/nginx/access.log:9323
     /var/log/nginx/error.log:0
    @@ -631,69 +515,51 @@ ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics]
     /var/log/nginx/statistics.log:0
     # grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-10-20 | sort | uniq
     8915
    -
  • - -
  • Last month I added “crawl” to the Tomcat Crawler Session Manager Valve’s regular expression matching, and it seems to be working for MegaIndex’s user agent:

    - -
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1' User-Agent:'"Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)"'
    -
  • - -
  • So I’m not sure why this bot uses so many sessions — is it because it requests very slowly?

  • + - -

    2018-10-21

    - +
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1' User-Agent:'"Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)"'
    +
    +

    2018-10-21

    - -

    2018-10-22

    - +

    2018-10-22

    dspace=# UPDATE metadatavalue SET text_value=replace(text_value, 'http://', 'https://') WHERE resource_type_id=2 AND text_value LIKE 'http://hdl.handle.net%';
    -
    - -
  • While I was doing that I found two items using CGSpace URLs instead of handles in their dc.identifier.uri so I corrected those

  • - -
  • I also found several items that had invalid characters or multiple Handles in some related URL field like cg.link.reference so I corrected those too

  • - -
  • Improve the usage rights on the submission form by adding a default selection with no value as well as a better hint to look for the CC license on the publisher page or in the PDF (#398)

  • - -
  • I deployed the changes on CGSpace, ran all system updates, and rebooted the server

  • - -
  • Also, I updated all Handles in the database to use HTTPS:

    - +
    dspace=# UPDATE metadatavalue SET text_value=replace(text_value, 'http://', 'https://') WHERE resource_type_id=2 AND text_value LIKE 'http://hdl.handle.net%';
     UPDATE 76608
    -
  • - -
  • Skype with Peter about ToRs for the AReS open source work and future plans to develop tools around the DSpace ecosystem

  • - -
  • Help CGSpace users with some issues related to usage rights

  • + - -

    2018-10-23

    - +

    2018-10-23

    $ http --print b POST 'https://dspacetest.cgiar.org/rest/login' email='testdeposit@cgiar.org' password=deposit
     acef8a4a-41f3-4392-b870-e873790f696b
     
     $ http POST 'https://dspacetest.cgiar.org/rest/logout' rest-dspace-token:acef8a4a-41f3-4392-b870-e873790f696b
    -
    - -
  • Also works via curl (login, check status, logout, check status):

    - +
    $ curl -H "Content-Type: application/json" --data '{"email":"testdeposit@cgiar.org", "password":"deposit"}' https://dspacetest.cgiar.org/rest/login
     e09fb5e1-72b0-4811-a2e5-5c1cd78293cc
     $ curl -X GET -H "Content-Type: application/json" -H "Accept: application/json" -H "rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc" https://dspacetest.cgiar.org/rest/status
    @@ -701,28 +567,21 @@ $ curl -X GET -H "Content-Type: application/json" -H "Accept: app
     $ curl -X POST -H "Content-Type: application/json" -H "rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc" https://dspacetest.cgiar.org/rest/logout
     $ curl -X GET -H "Content-Type: application/json" -H "Accept: application/json" -H "rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc" https://dspacetest.cgiar.org/rest/status
     {"okay":true,"authenticated":false,"email":null,"fullname":null,"token":null}%
    -
  • - -
  • Improve the documentatin of my dspace-statistics-api

  • - -
  • Email Modi and Jayashree from ICRISAT to ask if they want to join CGSpace as partners

  • + - -

    2018-10-24

    - +

    2018-10-24

    - -

    2018-10-25

    - +

    2018-10-25

    +
  • Maria asked if we can add publisher (dc.publisher) to the advanced search filters, so I created a GitHub issue to track it
  • - -

    2018-10-28

    - +

    2018-10-28

    - -

    2018-10-29

    - +

    2018-10-29

    - -

    2018-10-30

    - +

    2018-10-30

    - -

    2018-10-31

    - + + +

    2018-10-31

    - - + diff --git a/docs/2018-11/index.html b/docs/2018-11/index.html index cb5e55265..63c12dbba 100644 --- a/docs/2018-11/index.html +++ b/docs/2018-11/index.html @@ -8,14 +8,11 @@ @@ -28,18 +25,15 @@ Today these are the top 10 IPs: - + @@ -120,20 +114,16 @@ Today these are the top 10 IPs:

    -

    2018-11-01

    - +

    2018-11-01

    - -

    2018-11-03

    - +

    2018-11-03

    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Nov/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
        1300 66.249.64.63
        1384 35.237.175.180
    @@ -145,239 +135,195 @@ Today these are the top 10 IPs:
        3367 84.38.130.177
        4537 70.32.83.92
       22508 66.249.64.59
    -
    - - +

    2018-11-11

    - -

    2018-11-13

    - +

    2018-11-13

    - -

    2018-11-14

    - +

    2018-11-14

    - -

    2018-11-15

    - +

    2018-11-15

    - -

    2018-11-18

    - +

    2018-11-18

    - -

    2018-11-19

    - +

    2018-11-19

    $ ./fix-metadata-values.py -i 2018-11-19-correct-agrovoc.csv -f dc.subject -t correct -m 57 -db dspace -u dspace -p 'fuu' -d
     $ ./delete-metadata-values.py -i 2018-11-19-delete-agrovoc.csv -f dc.subject -m 57 -db dspace -u dspace -p 'fuu' -d
    -
    - -
  • Then I ran them on both CGSpace and DSpace Test, and started a full Discovery re-index on CGSpace:

    - -
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    -
  • - -
  • Generate a new list of the top 1500 AGROVOC subjects on CGSpace to send to Peter and Sisay:

    - -
    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-11-19-top-1500-subject.csv WITH CSV HEADER;
    -
  • + - -

    2018-11-20

    - +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    +
    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-11-19-top-1500-subject.csv WITH CSV HEADER;
    +

    2018-11-20

    2018-11-19 15:23:04,221 ERROR com.atmire.dspace.discovery.AtmireSolrService @ DSpace kernel cannot be null
     java.lang.IllegalStateException: DSpace kernel cannot be null
    -    at org.dspace.utils.DSpace.getServiceManager(DSpace.java:63)
    -    at org.dspace.utils.DSpace.getSingletonService(DSpace.java:87)
    -    at com.atmire.dspace.discovery.AtmireSolrService.buildDocument(AtmireSolrService.java:102)
    -    at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:815)
    -    at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:884)
    -    at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370)
    -    at org.dspace.discovery.IndexClient.main(IndexClient.java:117)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -    at java.lang.reflect.Method.invoke(Method.java:498)
    -    at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    -    at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    +        at org.dspace.utils.DSpace.getServiceManager(DSpace.java:63)
    +        at org.dspace.utils.DSpace.getSingletonService(DSpace.java:87)
    +        at com.atmire.dspace.discovery.AtmireSolrService.buildDocument(AtmireSolrService.java:102)
    +        at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:815)
    +        at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:884)
    +        at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370)
    +        at org.dspace.discovery.IndexClient.main(IndexClient.java:117)
    +        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    +        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    +        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    +        at java.lang.reflect.Method.invoke(Method.java:498)
    +        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    +        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
     2018-11-19 15:23:04,223 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Processing (4629 of 76007): 72731
    -
    - -
  • I looked in the Solr log around that time and I don’t see anything…

  • - -
  • Working on Udana’s WLE records from last month, first the sixteen records in 2018-11-20 RDL Temp

    - + +
  • +
  • Then the 24 records in 2018-11-20 VRC Temp
  • +
  • I notice a few items using DOIs pointing at ICARDA's DSpace like: https://doi.org/20.500.11766/8178, which then points at the “real” DOI on the publisher's site… these should be using the real DOI instead of ICARDA's “fake” Handle DOI
  • +
  • Some items missing DOIs, but they clearly have them if you look at the publisher's site
  • - -

    2018-11-22

    - + + +

    2018-11-22

    - -

    2018-11-26

    - + + +

    2018-11-26

    $ dspace index-discovery -r 10568/41888
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
    -
    - -
  • … but the item still doesn’t appear in the collection

  • - -
  • Now I will try a full Discovery re-index:

    - -
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    -
  • - -
  • Ah, Marianne had set the item as private when she uploaded it, so it was still private

  • - -
  • I made it public and now it shows up in the collection list

  • - -
  • More work on the AReS terms of reference for CodeObia

  • - -
  • Erica from AgriKnowledge emailed me to say that they have implemented the changes in their item page UI so that they include the permanent identifier on items harvested from CGSpace, for example: https://www.agriknowledge.org/concern/generics/wd375w33s

  • + - -

    2018-11-27

    - +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    +

    2018-11-27

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "27/Nov/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -229 46.101.86.248
    -261 66.249.64.61
    -447 66.249.64.59
    -541 207.46.13.77
    -548 40.77.167.97
    -564 35.237.175.180
    -595 40.77.167.135
    -611 157.55.39.91
    -4564 205.186.128.185
    -4564 70.32.83.92
    -
    - -
  • We know 70.32.83.92 is CCAFS harvester on MediaTemple, but 205.186.128.185 is new appears to be a new CCAFS harvester

  • - -
  • I think we might want to prune some old accounts from CGSpace, perhaps users who haven’t logged in in the last two years would be a conservative bunch:

    - + 229 46.101.86.248 + 261 66.249.64.61 + 447 66.249.64.59 + 541 207.46.13.77 + 548 40.77.167.97 + 564 35.237.175.180 + 595 40.77.167.135 + 611 157.55.39.91 + 4564 205.186.128.185 + 4564 70.32.83.92 +
    $ dspace dsrun org.dspace.eperson.Groomer -a -b 11/27/2016 | wc -l
     409
     $ dspace dsrun org.dspace.eperson.Groomer -a -b 11/27/2016 -d
    -
  • - -
  • This deleted about 380 users, skipping those who have submissions in the repository

  • - -
  • Judy Kimani was having problems taking tasks in the ILRI project reports, papers and documents collection again

    - + - -

    2018-11-28

    - +
  • +
  • Help Marianne troubleshoot some issue with items in their WLE collections and the WLE publicatons website
  • + +

    2018-11-28

    - - + diff --git a/docs/2018-12/index.html b/docs/2018-12/index.html index fc04603e0..8eb854092 100644 --- a/docs/2018-12/index.html +++ b/docs/2018-12/index.html @@ -8,15 +8,12 @@ @@ -28,18 +25,15 @@ I noticed that there is another issue with PDF thumbnails on CGSpace, and I see - + @@ -120,61 +114,51 @@ I noticed that there is another issue with PDF thumbnails on CGSpace, and I see

    -

    2018-12-01

    - +

    2018-12-01

    - -

    2018-12-02

    - +

    2018-12-02

    -
    org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d" "-f/tmp/magick-129895Bmp44lvUfxo" "-f/tmp/magick-12989C0QFG51fktLF"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
     org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d" "-f/tmp/magick-129895Bmp44lvUfxo" "-f/tmp/magick-12989C0QFG51fktLF"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
    -    at org.im4java.core.Info.getBaseInfo(Info.java:360)
    -    at org.im4java.core.Info.<init>(Info.java:151)
    -    at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:142)
    -    at org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.getDestinationStream(ImageMagickPdfThumbnailFilter.java:24)
    -    at org.dspace.app.mediafilter.FormatFilter.processBitstream(FormatFilter.java:170)
    -    at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:475)
    -    at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:429)
    -    at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:401)
    -    at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:237)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -    at java.lang.reflect.Method.invoke(Method.java:498)
    -    at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    -    at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    -
    - -
  • A comment on StackOverflow question from yesterday suggests it might be a bug with the pngalpha device in Ghostscript and links to an upstream bug

  • - -
  • I think we need to wait for a fix from Ubuntu

  • - -
  • For what it’s worth, I get the same error on my local Arch Linux environment with Ghostscript 9.26:

    - + at org.im4java.core.Info.getBaseInfo(Info.java:360) + at org.im4java.core.Info.<init>(Info.java:151) + at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:142) + at org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.getDestinationStream(ImageMagickPdfThumbnailFilter.java:24) + at org.dspace.app.mediafilter.FormatFilter.processBitstream(FormatFilter.java:170) + at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:475) + at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:429) + at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:401) + at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:237) + at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) + at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) + at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) + at java.lang.reflect.Method.invoke(Method.java:498) + at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226) + at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78) +
    $ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
     DEBUG: FC_WEIGHT didn't match
     zsh: segmentation fault (core dumped)  gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000
    -
  • - -
  • When I replace the pngalpha device with png16m as suggested in the StackOverflow comments it works:

    - +
    $ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=png16m -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
     DEBUG: FC_WEIGHT didn't match
    -
  • - -
  • Start proofing the latest round of 226 IITA archive records that Bosede sent last week and Sisay uploaded to DSpace Test this weekend (IITA_Dec_1_1997 aka Daniel1807)

    - + - -

    2018-12-03

    - +
  • +
  • Expand my “encoding error” detection GREL to include ~ as I saw a lot of that in some copy pasted French text recently:
  • + +
    or(
    +  isNotNull(value.match(/.*\uFFFD.*/)),
    +  isNotNull(value.match(/.*\u00A0.*/)),
    +  isNotNull(value.match(/.*\u200A.*/)),
    +  isNotNull(value.match(/.*\u2019.*/)),
    +  isNotNull(value.match(/.*\u00b4.*/)),
    +  isNotNull(value.match(/.*\u007e.*/))
    +)
    +

    2018-12-03

    $ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf
    -
    - -
  • So it seems to be something about the PDFs themselves, perhaps related to alpha support?

  • - -
  • The first item (1056898394) has the following information:

    - +
    $ identify Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf\[0\]
     Info Note Mainstreaming gender and social differentiation into CCAFS research activities in West Africa-converted.pdf[0]=>Info Note Mainstreaming gender and social differentiation into CCAFS research activities in West Africa-converted.pdf PDF 595x841 595x841+0+0 16-bit sRGB 107443B 0.000u 0:00.000
     identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1746.
    -
  • - -
  • And wow, I can’t even run ImageMagick’s identify on the first page of the second item (1056898930):

    - +
    $ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
     zsh: abort (core dumped)  identify Food\ safety\ Kenya\ fruits.pdf\[0\]
    -
  • - -
  • But with GraphicsMagick’s identify it works:

    - +
    $ gm identify Food\ safety\ Kenya\ fruits.pdf\[0\]
     DEBUG: FC_WEIGHT didn't match
     Food safety Kenya fruits.pdf PDF 612x792+0+0 DirectClass 8-bit 1.4Mi 0.000u 0m:0.000002s
    -
  • - -
  • Interesting that ImageMagick’s identify does work if you do not specify a page, perhaps as alluded to in the recent Ghostscript bug report:

    - +
    $ identify Food\ safety\ Kenya\ fruits.pdf
     Food safety Kenya fruits.pdf[0] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
     Food safety Kenya fruits.pdf[1] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
    @@ -243,311 +217,258 @@ Food safety Kenya fruits.pdf[2] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010
     Food safety Kenya fruits.pdf[3] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
     Food safety Kenya fruits.pdf[4] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
     identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1746.
    -
  • - -
  • As I expected, ImageMagick cannot generate a thumbnail, but GraphicsMagick can (though it looks like crap):

    - +
    $ convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten Food\ safety\ Kenya\ fruits.pdf.jpg
     zsh: abort (core dumped)  convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten
     $ gm convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten Food\ safety\ Kenya\ fruits.pdf.jpg
     DEBUG: FC_WEIGHT didn't match
    -
  • - -
  • I inspected the troublesome PDF using jhove and noticed that it is using ISO PDF/A-1, Level B and the other one doesn’t list a profile, though I don’t think this is relevant

  • - -
  • I found another item that fails when generating a thumbnail (1056898391, DSpace complains:

    - +
    org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
     org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
    -    at org.im4java.core.Info.getBaseInfo(Info.java:360)
    -    at org.im4java.core.Info.<init>(Info.java:151)
    -    at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:142)
    -    at org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.getDestinationStream(ImageMagickPdfThumbnailFilter.java:24)
    -    at org.dspace.app.mediafilter.FormatFilter.processBitstream(FormatFilter.java:170)
    -    at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:475)
    -    at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:429)
    -    at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:401)
    -    at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:237)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    -    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    -    at java.lang.reflect.Method.invoke(Method.java:498)
    -    at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    -    at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    +        at org.im4java.core.Info.getBaseInfo(Info.java:360)
    +        at org.im4java.core.Info.<init>(Info.java:151)
    +        at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:142)
    +        at org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.getDestinationStream(ImageMagickPdfThumbnailFilter.java:24)
    +        at org.dspace.app.mediafilter.FormatFilter.processBitstream(FormatFilter.java:170)
    +        at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:475)
    +        at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:429)
    +        at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:401)
    +        at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:237)
    +        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    +        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    +        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    +        at java.lang.reflect.Method.invoke(Method.java:498)
    +        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
    +        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
     Caused by: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
    -    at org.im4java.core.ImageCommand.run(ImageCommand.java:219)
    -    at org.im4java.core.Info.getBaseInfo(Info.java:342)
    -    ... 14 more
    +        at org.im4java.core.ImageCommand.run(ImageCommand.java:219)
    +        at org.im4java.core.Info.getBaseInfo(Info.java:342)
    +        ... 14 more
     Caused by: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
    -    at org.im4java.core.ImageCommand.finished(ImageCommand.java:253)
    -    at org.im4java.process.ProcessStarter.run(ProcessStarter.java:314)
    -    at org.im4java.core.ImageCommand.run(ImageCommand.java:215)
    -    ... 15 more
    -
  • - -
  • And on my Arch Linux environment ImageMagick’s convert also segfaults:

    - + at org.im4java.core.ImageCommand.finished(ImageCommand.java:253) + at org.im4java.process.ProcessStarter.run(ProcessStarter.java:314) + at org.im4java.core.ImageCommand.run(ImageCommand.java:215) + ... 15 more +
    $ convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] -thumbnail x600 -flatten bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf.jpg
     zsh: abort (core dumped)  convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\]  x60
    -
  • - -
  • But GraphicsMagick’s convert works:

    - +
    $ gm convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] -thumbnail x600 -flatten bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf.jpg
    -
  • - -
  • So far the only thing that stands out is that the two files that don’t work were created with Microsoft Office 2016:

    - +
    $ pdfinfo bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf | grep -E '^(Creator|Producer)'
     Creator:        Microsoft® Word 2016
     Producer:       Microsoft® Word 2016
     $ pdfinfo Food\ safety\ Kenya\ fruits.pdf | grep -E '^(Creator|Producer)'
     Creator:        Microsoft® Word 2016
     Producer:       Microsoft® Word 2016
    -
  • - -
  • And the one that works was created with Office 365:

    - +
    $ pdfinfo Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf | grep -E '^(Creator|Producer)'
     Creator:        Microsoft® Word for Office 365
     Producer:       Microsoft® Word for Office 365
    -
  • - -
  • I remembered an old technique I was using to generate thumbnails in 2015 using Inkscape followed by ImageMagick or GraphicsMagick:

    - +
    $ inkscape Food\ safety\ Kenya\ fruits.pdf -z --export-dpi=72 --export-area-drawing --export-png='cover.png'
     $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
    -
  • - -
  • I’ve tried a few times this week to register for the Ethiopian eVisa website, but it is never successful

  • - -
  • In the end I tried one last time to just apply without registering and it was apparently successful

  • - -
  • Testing DSpace 5.8 (5_x-prod branch) in an Ubuntu 18.04 VM with Tomcat 8.5 and had some issues:

    - +
    2018-12-03 15:44:00,030 WARN  org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
     2018-12-03 15:44:03,390 ERROR com.atmire.app.webui.servlet.ExportServlet @ Error converter plugin not found: interface org.infoCon.ConverterPlugin
     ...
     2018-12-03 15:45:01,667 WARN  org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-listing-and-reports not found
    -
  • - - -
  • I tested it on my local environment with Tomcat 8.5.34 and the JSPUI application still has an error (again, the logs show something about tag cloud, so be unrelated), and the Listings and Reports still asks you to log in again, despite already being logged in in XMLUI, but does appear to work (I generated a report and exported a PDF)

  • - -
  • I think the errors about missing Atmire components must be important, here on my local machine as well (though not the one about atmire-listings-and-reports):

    - -
    2018-12-03 16:44:00,009 WARN  org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
    -
  • - -
  • This has got to be part Ubuntu Tomcat packaging, and part DSpace 5.x Tomcat 8.5 readiness…?

  • + - -

    2018-12-04

    - +
    2018-12-03 16:44:00,009 WARN  org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
    +
    +

    2018-12-04

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018:1(5|6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -225 40.77.167.142
    -226 66.249.64.63
    -232 46.101.86.248
    -285 45.5.186.2
    -333 54.70.40.11
    -411 193.29.13.85
    -476 34.218.226.147
    -962 66.249.70.27
    -1193 35.237.175.180
    -1450 2a01:4f8:140:3192::2
    +    225 40.77.167.142
    +    226 66.249.64.63
    +    232 46.101.86.248
    +    285 45.5.186.2
    +    333 54.70.40.11
    +    411 193.29.13.85
    +    476 34.218.226.147
    +    962 66.249.70.27
    +   1193 35.237.175.180
    +   1450 2a01:4f8:140:3192::2
     # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -1141 207.46.13.57
    -1299 197.210.168.174
    -1341 54.70.40.11
    -1429 40.77.167.142
    -1528 34.218.226.147
    -1973 66.249.70.27
    -2079 50.116.102.77
    -2494 78.46.79.71
    -3210 2a01:4f8:140:3192::2
    -4190 35.237.175.180
    -
    - -
  • 35.237.175.180 is known to us (CCAFS?), and I’ve already added it to the list of bot IPs in nginx, which appears to be working:

    - + 1141 207.46.13.57 + 1299 197.210.168.174 + 1341 54.70.40.11 + 1429 40.77.167.142 + 1528 34.218.226.147 + 1973 66.249.70.27 + 2079 50.116.102.77 + 2494 78.46.79.71 + 3210 2a01:4f8:140:3192::2 + 4190 35.237.175.180 +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03
     4772
     $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03 | sort | uniq | wc -l
     630
    -
  • - -
  • I haven’t seen 2a01:4f8:140:3192::2 before. Its user agent is some new bot:

    - +
    Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
    -
  • - -
  • At least it seems the Tomcat Crawler Session Manager Valve is working to re-use the common bot XMLUI sessions:

    - +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03
     5111
     $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03 | sort | uniq | wc -l
     419
    -
  • - -
  • 78.46.79.71 is another host on Hetzner with the following user agent:

    - +
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
    -
  • - -
  • This is not the first time a host on Hetzner has used a “normal” user agent to make thousands of requests

  • - -
  • At least it is re-using its Tomcat sessions somehow:

    - +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
     2044
     $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03 | sort | uniq | wc -l
     1
    -
  • - -
  • In other news, it’s good to see my re-work of the database connectivity in the dspace-statistics-api actually caused a reduction of persistent database connections (from 1 to 0, but still!):

  • + - -

    PostgreSQL connections day

    - -

    2018-12-05

    - +

    PostgreSQL connections day

    +

    2018-12-05

    - -

    2018-12-06

    - +

    2018-12-06

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "05/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -1225 157.55.39.177
    -1240 207.46.13.12
    -1261 207.46.13.101
    -1411 207.46.13.157
    -1529 34.218.226.147
    -2085 50.116.102.77
    -3334 2a01:7e00::f03c:91ff:fe0a:d645
    -3733 66.249.70.27
    -3815 35.237.175.180
    -7669 54.70.40.11
    -
    - -
  • 54.70.40.11 is some new bot with the following user agent:

    - + 1225 157.55.39.177 + 1240 207.46.13.12 + 1261 207.46.13.101 + 1411 207.46.13.157 + 1529 34.218.226.147 + 2085 50.116.102.77 + 3334 2a01:7e00::f03c:91ff:fe0a:d645 + 3733 66.249.70.27 + 3815 35.237.175.180 + 7669 54.70.40.11 +
    Mozilla/5.0 (compatible) SemanticScholarBot (+https://www.semanticscholar.org/crawler)
    -
  • - -
  • But Tomcat is forcing them to re-use their Tomcat sessions with the Crawler Session Manager valve:

    - +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
     6980
     $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05 | sort | uniq | wc -l
     1156
    -
  • - -
  • 2a01:7e00::f03c:91ff:fe0a:d645 appears to be the CKM dev server where Danny is testing harvesting via Drupal

  • - -
  • It seems they are hitting the XMLUI’s OpenSearch a bit, but mostly on the REST API so no issues here yet

  • - -
  • Drupal is already in the Tomcat Crawler Session Manager Valve’s regex so that’s good!

  • + - -

    2018-12-10

    - +

    2018-12-10

    - -

    2018-12-11

    - + + +

    2018-12-11

    - -

    2018-12-13

    - + + +

    2018-12-13

    - -

    2018-12-17

    - +

    2018-12-17

    - -

    2018-12-18

    - +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "17/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    927 157.55.39.81
    +    975 54.70.40.11
    +   2090 50.116.102.77
    +   2121 66.249.66.219
    +   3811 35.237.175.180
    +   4590 205.186.128.185
    +   4590 70.32.83.92
    +   5436 2a01:4f8:173:1e85::2
    +   5438 143.233.227.216
    +   6706 94.71.244.172
    +
    +
    Mozilla/3.0 (compatible; Indy Library)
    +
    +

    2018-12-18

    - -

    2018-12-19

    - +

    2018-12-19

    - -

    2018-12-20

    - +

    2018-12-20

    $ time xz -c cgspace_2018-12-19.backup > cgspace_2018-12-19.backup.xz
     xz -c cgspace_2018-12-19.backup > cgspace_2018-12-19.backup.xz  48.29s user 0.19s system 99% cpu 48.579 total
     $ time gzip -c cgspace_2018-12-19.backup > cgspace_2018-12-19.backup.gz
    @@ -556,41 +477,32 @@ $ ls -lh cgspace_2018-12-19.backup*
     -rw-r--r-- 1 aorth aorth 96M Dec 19 02:15 cgspace_2018-12-19.backup
     -rw-r--r-- 1 aorth aorth 94M Dec 20 11:36 cgspace_2018-12-19.backup.gz
     -rw-r--r-- 1 aorth aorth 93M Dec 20 11:35 cgspace_2018-12-19.backup.xz
    -
    - -
  • Looks like it’s really not worth it…

  • - -
  • Peter pointed out that Discovery filters for CTA subjects on item pages were not working

  • - -
  • It looks like there were some mismatches in the Discovery index names and the XMLUI configuration, so I fixed them (#406)

  • - -
  • Peter asked if we could create a controlled vocabulary for publishers (dc.publisher)

  • - -
  • I see we have about 3500 distinct publishers:

    - +
    # SELECT COUNT(DISTINCT(text_value)) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=39;
    -count
    + count
     -------
    -3522
    +  3522
     (1 row)
    -
  • - -
  • I reverted the metadata changes related to “Unrestricted Access” and “Restricted Access” on DSpace Test because we’re not pushing forward with the new status terms for now

  • - -
  • Purge remaining Oracle Java 8 stuff from CGSpace (linode18) since we migrated to OpenJDK a few months ago:

    - +
    # dpkg -P oracle-java8-installer oracle-java8-set-default
    -
  • - -
  • Update usage rights on CGSpace as we agreed with Maria Garruccio and Peter last month:

    - +
    $ ./fix-metadata-values.py -i /tmp/2018-11-27-update-rights.csv -f dc.rights -t correct -m 53 -db dspace -u dspace -p 'fuu' -d
     Connected to database.
     Fixed 466 occurences of: Copyrighted; Any re-use allowed
    -
  • - -
  • Upgrade PostgreSQL on CGSpace (linode18) from 9.5 to 9.6:

    - +
    # apt install postgresql-9.6 postgresql-client-9.6 postgresql-contrib-9.6 postgresql-server-dev-9.6
     # pg_ctlcluster 9.5 main stop
     # tar -cvzpf var-lib-postgresql-9.5.tar.gz /var/lib/postgresql/9.5
    @@ -600,72 +512,60 @@ Fixed 466 occurences of: Copyrighted; Any re-use allowed
     # pg_upgradecluster 9.5 main
     # pg_dropcluster 9.5 main
     # dpkg -l | grep postgresql | grep 9.5 | awk '{print $2}' | xargs dpkg -r
    -
  • - -
  • I’ve been running PostgreSQL 9.6 for months on my local development and public DSpace Test (linode19) environments

  • - -
  • Run all system updates on CGSpace (linode18) and restart the server

  • - -
  • Try to run the DSpace cleanup script on CGSpace (linode18), but I get some errors about foreign key constraints:

    - +
    $ dspace cleanup -v
    -- Deleting bitstream information (ID: 158227)
    -- Deleting bitstream record from database (ID: 158227)
    + - Deleting bitstream information (ID: 158227)
    + - Deleting bitstream record from database (ID: 158227)
     Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -Detail: Key (bitstream_id)=(158227) is still referenced from table "bundle".
    +  Detail: Key (bitstream_id)=(158227) is still referenced from table "bundle".
     ...
    -
  • - -
  • As always, the solution is to delete those IDs manually in PostgreSQL:

    - +
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (158227, 158251);'
     UPDATE 1
    -
  • - -
  • After all that I started a full Discovery reindex to get the index name changes and rights updates

  • + - -

    2018-12-29

    - +

    2018-12-29

    - - +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "29/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    963 40.77.167.152
    +    987 35.237.175.180
    +   1062 40.77.167.55
    +   1464 66.249.66.223
    +   1660 34.218.226.147
    +   1801 70.32.83.92
    +   2005 50.116.102.77
    +   3218 66.249.66.219
    +   4608 205.186.128.185
    +   5585 54.70.40.11
    +
    +
    # zcat --force /var/log/nginx/*.log.1 /var/log/nginx/*.log.2.gz | grep -E "29/Dec/2018:1(6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    115 66.249.66.223
    +    118 207.46.13.14
    +    123 34.218.226.147
    +    133 95.108.181.88
    +    137 35.237.175.180
    +    164 66.249.66.219
    +    260 157.55.39.59
    +    291 40.77.167.55
    +    312 207.46.13.129
    +   1253 54.70.40.11
    +
    + diff --git a/docs/2019-01/index.html b/docs/2019-01/index.html index 3a50a3c79..cbabd0741 100644 --- a/docs/2019-01/index.html +++ b/docs/2019-01/index.html @@ -8,23 +8,20 @@ @@ -35,25 +32,22 @@ I don’t see anything interesting in the web server logs around that time t - + @@ -134,45 +128,38 @@ I don’t see anything interesting in the web server logs around that time t

    -

    2019-01-02

    - +

    2019-01-02

    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +     92 40.77.167.4
    +     99 210.7.29.100
    +    120 38.126.157.45
    +    177 35.237.175.180
    +    177 40.77.167.32
    +    216 66.249.75.219
    +    225 18.203.76.93
    +    261 46.101.86.248
    +    357 207.46.13.1
    +    903 54.70.40.11
    +
    - -
    2019-01-03 14:45:21,727 INFO  org.dspace.browse.BrowseEngine @ anonymous:session_id=9471D72242DAA05BCC87734FE3C66EA6:ip_addr=127.0.0.1:browse_mini:
     2019-01-03 14:45:21,971 INFO  org.dspace.app.webui.discovery.DiscoverUtility @ facets for scope, null: 23
     2019-01-03 14:45:22,115 WARN  org.dspace.app.webui.servlet.InternalErrorServlet @ :session_id=9471D72242DAA05BCC87734FE3C66EA6:internal_error:-- URL Was: http://localhost:8080/jspui/internal-error
    @@ -215,107 +196,100 @@ $ sudo docker run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/d
     -- Parameters were:
     
     org.apache.jasper.JasperException: /home.jsp (line: [214], column: [1]) /discovery/static-tagcloud-facet.jsp (line: [57], column: [8]) No tag [tagcloud] defined in tag library imported with prefix [dspace]
    -at org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:41)
    -at org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.java:291)
    -at org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.java:97)
    -at org.apache.jasper.compiler.Parser.processIncludeDirective(Parser.java:347)
    -at org.apache.jasper.compiler.Parser.parseIncludeDirective(Parser.java:380)
    -at org.apache.jasper.compiler.Parser.parseDirective(Parser.java:481)
    -at org.apache.jasper.compiler.Parser.parseElements(Parser.java:1445)
    -at org.apache.jasper.compiler.Parser.parseBody(Parser.java:1683)
    -at org.apache.jasper.compiler.Parser.parseOptionalBody(Parser.java:1016)
    -at org.apache.jasper.compiler.Parser.parseCustomTag(Parser.java:1291)
    -at org.apache.jasper.compiler.Parser.parseElements(Parser.java:1470)
    -at org.apache.jasper.compiler.Parser.parse(Parser.java:144)
    -at org.apache.jasper.compiler.ParserController.doParse(ParserController.java:244)
    -at org.apache.jasper.compiler.ParserController.parse(ParserController.java:105)
    -at org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:202)
    -at org.apache.jasper.compiler.Compiler.compile(Compiler.java:373)
    -at org.apache.jasper.compiler.Compiler.compile(Compiler.java:350)
    -at org.apache.jasper.compiler.Compiler.compile(Compiler.java:334)
    -at org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:595)
    -at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:399)
    -at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:386)
    -at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:330)
    -at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
    -at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
    -at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    -at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
    -at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    -at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    -at org.apache.catalina.core.ApplicationDispatcher.invoke(ApplicationDispatcher.java:728)
    -at org.apache.catalina.core.ApplicationDispatcher.processRequest(ApplicationDispatcher.java:470)
    -at org.apache.catalina.core.ApplicationDispatcher.doForward(ApplicationDispatcher.java:395)
    -at org.apache.catalina.core.ApplicationDispatcher.forward(ApplicationDispatcher.java:316)
    -at org.dspace.app.webui.util.JSPManager.showJSP(JSPManager.java:60)
    -at org.apache.jsp.index_jsp._jspService(index_jsp.java:191)
    -at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
    -at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
    -at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:476)
    -at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:386)
    -at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:330)
    -at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
    -at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
    -at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    -at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
    -at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    -at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    -at org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:78)
    -at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    -at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    -at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:198)
    -at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96)
    -at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:493)
    -at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:140)
    -at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:81)
    -at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:234)
    -at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:650)
    -at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87)
    -at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:342)
    -at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:800)
    -at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
    -at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:806)
    -at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1498)
    -at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
    -at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    -at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    -at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    -at java.lang.Thread.run(Thread.java:748)
    -
    - -
  • I notice that I get different JSESSIONID cookies for / (XMLUI) and /jspui (JSPUI) on Tomcat 8.5.37, I wonder if it’s the same on Tomcat 7.0.92… yes I do.

  • - -
  • Hmm, on Tomcat 7.0.92 I see that I get a dspace.current.user.id session cookie after logging into XMLUI, and then when I browse to JSPUI I am still logged in…

    - + at org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:41) + at org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.java:291) + at org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.java:97) + at org.apache.jasper.compiler.Parser.processIncludeDirective(Parser.java:347) + at org.apache.jasper.compiler.Parser.parseIncludeDirective(Parser.java:380) + at org.apache.jasper.compiler.Parser.parseDirective(Parser.java:481) + at org.apache.jasper.compiler.Parser.parseElements(Parser.java:1445) + at org.apache.jasper.compiler.Parser.parseBody(Parser.java:1683) + at org.apache.jasper.compiler.Parser.parseOptionalBody(Parser.java:1016) + at org.apache.jasper.compiler.Parser.parseCustomTag(Parser.java:1291) + at org.apache.jasper.compiler.Parser.parseElements(Parser.java:1470) + at org.apache.jasper.compiler.Parser.parse(Parser.java:144) + at org.apache.jasper.compiler.ParserController.doParse(ParserController.java:244) + at org.apache.jasper.compiler.ParserController.parse(ParserController.java:105) + at org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:202) + at org.apache.jasper.compiler.Compiler.compile(Compiler.java:373) + at org.apache.jasper.compiler.Compiler.compile(Compiler.java:350) + at org.apache.jasper.compiler.Compiler.compile(Compiler.java:334) + at org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:595) + at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:399) + at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:386) + at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:330) + at javax.servlet.http.HttpServlet.service(HttpServlet.java:742) + at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231) + at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) + at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) + at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) + at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) + at org.apache.catalina.core.ApplicationDispatcher.invoke(ApplicationDispatcher.java:728) + at org.apache.catalina.core.ApplicationDispatcher.processRequest(ApplicationDispatcher.java:470) + at org.apache.catalina.core.ApplicationDispatcher.doForward(ApplicationDispatcher.java:395) + at org.apache.catalina.core.ApplicationDispatcher.forward(ApplicationDispatcher.java:316) + at org.dspace.app.webui.util.JSPManager.showJSP(JSPManager.java:60) + at org.apache.jsp.index_jsp._jspService(index_jsp.java:191) + at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70) + at javax.servlet.http.HttpServlet.service(HttpServlet.java:742) + at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:476) + at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:386) + at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:330) + at javax.servlet.http.HttpServlet.service(HttpServlet.java:742) + at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231) + at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) + at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) + at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) + at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) + at org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:78) + at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) + at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) + at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:198) + at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96) + at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:493) + at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:140) + at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:81) + at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:234) + at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:650) + at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87) + at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:342) + at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:800) + at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66) + at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:806) + at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1498) + at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) + at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) + at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) + at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) + at java.lang.Thread.run(Thread.java:748) + - -

    2019-01-04

    - +
  • +
  • I sent a message to the dspace-tech mailing list to ask
  • + +

    2019-01-04

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Jan/2019:1(7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -189 207.46.13.192
    -217 31.6.77.23
    -340 66.249.70.29
    -349 40.77.167.86
    -417 34.218.226.147
    -630 207.46.13.173
    -710 35.237.175.180
    -790 40.77.167.87
    -1776 66.249.70.27
    -2099 54.70.40.11
    -
    - -
  • I’m thinking about trying to validate our dc.subject terms against AGROVOC webservices

  • - -
  • There seem to be a few APIs and the documentation is kinda confusing, but I found this REST endpoint that does work well, for example searching for SOIL:

    - + 189 207.46.13.192 + 217 31.6.77.23 + 340 66.249.70.29 + 349 40.77.167.86 + 417 34.218.226.147 + 630 207.46.13.173 + 710 35.237.175.180 + 790 40.77.167.87 + 1776 66.249.70.27 + 2099 54.70.40.11 +
    $ http http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=SOIL&lang=en
     HTTP/1.1 200 OK
     Access-Control-Allow-Origin: *
    @@ -331,40 +305,38 @@ X-Content-Type-Options: nosniff
     X-Frame-Options: ALLOW-FROM http://aims.fao.org
     
     {
    -"@context": {
    -    "@language": "en",
    -    "altLabel": "skos:altLabel",
    -    "hiddenLabel": "skos:hiddenLabel",
    -    "isothes": "http://purl.org/iso25964/skos-thes#",
    -    "onki": "http://schema.onki.fi/onki#",
    -    "prefLabel": "skos:prefLabel",
    -    "results": {
    -        "@container": "@list",
    -        "@id": "onki:results"
    +    "@context": {
    +        "@language": "en",
    +        "altLabel": "skos:altLabel",
    +        "hiddenLabel": "skos:hiddenLabel",
    +        "isothes": "http://purl.org/iso25964/skos-thes#",
    +        "onki": "http://schema.onki.fi/onki#",
    +        "prefLabel": "skos:prefLabel",
    +        "results": {
    +            "@container": "@list",
    +            "@id": "onki:results"
    +        },
    +        "skos": "http://www.w3.org/2004/02/skos/core#",
    +        "type": "@type",
    +        "uri": "@id"
         },
    -    "skos": "http://www.w3.org/2004/02/skos/core#",
    -    "type": "@type",
    -    "uri": "@id"
    -},
    -"results": [
    -    {
    -        "lang": "en",
    -        "prefLabel": "soil",
    -        "type": [
    -            "skos:Concept"
    -        ],
    -        "uri": "http://aims.fao.org/aos/agrovoc/c_7156",
    -        "vocab": "agrovoc"
    -    }
    -],
    -"uri": ""
    +    "results": [
    +        {
    +            "lang": "en",
    +            "prefLabel": "soil",
    +            "type": [
    +                "skos:Concept"
    +            ],
    +            "uri": "http://aims.fao.org/aos/agrovoc/c_7156",
    +            "vocab": "agrovoc"
    +        }
    +    ],
    +    "uri": ""
     }
    -
  • - -
  • The API does not appear to be case sensitive (searches for SOIL and soil return the same thing)

  • - -
  • I’m a bit confused that there’s no obvious return code or status when a term is not found, for example SOILS:

    - +
    HTTP/1.1 200 OK
     Access-Control-Allow-Origin: *
     Connection: Keep-Alive
    @@ -379,30 +351,28 @@ X-Content-Type-Options: nosniff
     X-Frame-Options: ALLOW-FROM http://aims.fao.org
     
     {
    -"@context": {
    -    "@language": "en",
    -    "altLabel": "skos:altLabel",
    -    "hiddenLabel": "skos:hiddenLabel",
    -    "isothes": "http://purl.org/iso25964/skos-thes#",
    -    "onki": "http://schema.onki.fi/onki#",
    -    "prefLabel": "skos:prefLabel",
    -    "results": {
    -        "@container": "@list",
    -        "@id": "onki:results"
    +    "@context": {
    +        "@language": "en",
    +        "altLabel": "skos:altLabel",
    +        "hiddenLabel": "skos:hiddenLabel",
    +        "isothes": "http://purl.org/iso25964/skos-thes#",
    +        "onki": "http://schema.onki.fi/onki#",
    +        "prefLabel": "skos:prefLabel",
    +        "results": {
    +            "@container": "@list",
    +            "@id": "onki:results"
    +        },
    +        "skos": "http://www.w3.org/2004/02/skos/core#",
    +        "type": "@type",
    +        "uri": "@id"
         },
    -    "skos": "http://www.w3.org/2004/02/skos/core#",
    -    "type": "@type",
    -    "uri": "@id"
    -},
    -"results": [],
    -"uri": ""
    +    "results": [],
    +    "uri": ""
     }
    -
  • - -
  • I guess the results object will just be empty…

  • - -
  • Another way would be to try with SPARQL, perhaps using the Python 2.7 sparql-client:

    - +
    $ python2.7 -m virtualenv /tmp/sparql
     $ . /tmp/sparql/bin/activate
     $ pip install sparql-client ipython
    @@ -410,16 +380,16 @@ $ ipython
     In [10]: import sparql
     In [11]: s = sparql.Service("http://agrovoc.uniroma2.it:3030/agrovoc/sparql", "utf-8", "GET")
     In [12]: statement=('PREFIX skos: <http://www.w3.org/2004/02/skos/core#> '
    -...: 'SELECT '
    -...: '?label '
    -...: 'WHERE { '
    -...: '{  ?concept  skos:altLabel ?label . } UNION {  ?concept  skos:prefLabel ?label . } '
    -...: 'FILTER regex(str(?label), "^fish", "i") . '
    -...: '} LIMIT 10')
    +    ...: 'SELECT '
    +    ...: '?label '
    +    ...: 'WHERE { '
    +    ...: '{  ?concept  skos:altLabel ?label . } UNION {  ?concept  skos:prefLabel ?label . } '
    +    ...: 'FILTER regex(str(?label), "^fish", "i") . '
    +    ...: '} LIMIT 10')
     In [13]: result = s.query(statement)
     In [14]: for row in result.fetchone():
    -...:     print(row)
    -...:
    +   ...:     print(row)
    +   ...:
     (<Literal "fish catching"@en>,)
     (<Literal "fish harvesting"@en>,)
     (<Literal "fish meat"@en>,)
    @@ -430,362 +400,337 @@ In [14]: for row in result.fetchone():
     (<Literal "fishflies"@en>,)
     (<Literal "fishery biology"@en>,)
     (<Literal "fish production"@en>,)
    -
  • - -
  • The SPARQL query comes from my notes in 2017-08

  • + - -

    2019-01-06

    - +

    2019-01-06

    - -

    2019-01-07

    - + + +

    2019-01-07

    - -

    2019-01-08

    - + + +

    2019-01-08

    - -

    2019-01-11

    - + + +

    2019-01-11

    - -

    2019-01-14

    - + + +

    2019-01-14

    - -

    2019-01-15

    - +

    2019-01-15

    + +
  • I am testing the speed of the WorldFish DSpace repository's REST API and it's five to ten times faster than CGSpace as I tested in 2018-10:
  • +
    $ time http --print h 'https://digitalarchive.worldfishcenter.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
     
     0.16s user 0.03s system 3% cpu 5.185 total
     0.17s user 0.02s system 2% cpu 7.123 total
     0.18s user 0.02s system 6% cpu 3.047 total
    -
    - -
  • In other news, Linode sent a mail last night that the CPU load on CGSpace (linode18) was high, here are the top IPs in the logs around those few hours:

    - -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "14/Jan/2019:(17|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -157 31.6.77.23
    -192 54.70.40.11
    -202 66.249.64.157
    -207 40.77.167.204
    -220 157.55.39.140
    -326 197.156.105.116
    -385 207.46.13.158
    -1211 35.237.175.180
    -1830 66.249.64.155
    -2482 45.5.186.2
    -
  • + - -

    2019-01-16

    - +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "14/Jan/2019:(17|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    157 31.6.77.23
    +    192 54.70.40.11
    +    202 66.249.64.157
    +    207 40.77.167.204
    +    220 157.55.39.140
    +    326 197.156.105.116
    +    385 207.46.13.158
    +   1211 35.237.175.180
    +   1830 66.249.64.155
    +   2482 45.5.186.2
    +

    2019-01-16

    +
  • Notes from our CG Core 2.0 metadata discussion: - +
  • Move dc.contributor.author to dc.creator
  • dc.contributor Project -
  • + +
  • dc.contributor Project Lead Center -
  • + +
  • dc.contributor Partner -
  • + +
  • dc.contributor Donor -
  • + +
  • dc.date -
  • + +
  • dc.language -
  • + +
  • dc.identifier -
  • + +
  • dc.identifier bibliographicCitation -
  • + +
  • dc.description.notes -
  • + +
  • dc.relation -
  • + +
  • dc.relation.isPartOf -
  • + +
  • dc.audience -
  • - + + + + + +
  • Something happened to the Solr usage statistics on CGSpace -
  • +
  • I looked on the server and the Solr cores are there (56GB!), and I don't see any obvious errors in dmesg or anything
  • +
  • I see that the server hasn't been rebooted in 26 days so I rebooted it
  • + +
  • After reboot the Solr stats are still messed up in the Atmire Usage Stats module, it only shows 2019-01!
  • - -

    Solr stats fucked up

    - +

    Solr stats fucked up

    statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
    -
    - -
  • Looking in the Solr log I see this:

    - +
    2019-01-16 13:37:55,395 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2018]: Error opening new searcher
     org.apache.solr.common.SolrException: Error opening new searcher
    -at org.apache.solr.core.SolrCore.<init>(SolrCore.java:873)
    -at org.apache.solr.core.SolrCore.<init>(SolrCore.java:646)
    -at org.apache.solr.core.CoreContainer.create(CoreContainer.java:491)
    -at org.apache.solr.core.CoreContainer.create(CoreContainer.java:466)
    -at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:575)
    -at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:199)
    -at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188)
    -at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
    -at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729)
    -at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:258)
    -at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
    -at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    -at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    -at org.dspace.solr.filters.LocalHostRestrictionFilter.doFilter(LocalHostRestrictionFilter.java:50)
    -at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    -at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    -at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
    -at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
    -at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)
    -at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
    -at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
    -at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180)
    -at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956)
    -at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
    -at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436)
    -at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078)
    -at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625)
    -at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316)
    -at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    -at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    -at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    -at java.lang.Thread.run(Thread.java:748)
    +    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:873)
    +    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:646)
    +    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:491)
    +    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:466)
    +    at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:575)
    +    at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:199)
    +    at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188)
    +    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
    +    at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729)
    +    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:258)
    +    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
    +    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    +    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    +    at org.dspace.solr.filters.LocalHostRestrictionFilter.doFilter(LocalHostRestrictionFilter.java:50)
    +    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    +    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    +    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
    +    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
    +    at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)
    +    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
    +    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
    +    at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180)
    +    at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956)
    +    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
    +    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436)
    +    at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078)
    +    at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625)
    +    at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316)
    +    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    +    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    +    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    +    at java.lang.Thread.run(Thread.java:748)
     Caused by: org.apache.solr.common.SolrException: Error opening new searcher
    -at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
    -at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
    -at org.apache.solr.core.SolrCore.<init>(SolrCore.java:845)
    -... 31 more
    +    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
    +    at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
    +    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:845)
    +    ... 31 more
     Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
    -at org.apache.lucene.store.Lock.obtain(Lock.java:89)
    -at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:753)
    -at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77)
    -at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
    -at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:279)
    -at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111)
    -at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528)
    -... 33 more
    +    at org.apache.lucene.store.Lock.obtain(Lock.java:89)
    +    at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:753)
    +    at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77)
    +    at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
    +    at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:279)
    +    at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111)
    +    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528)
    +    ... 33 more
     2019-01-16 13:37:55,401 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2018': Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
    -at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:613)
    -at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:199)
    -at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188)
    -at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
    -at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729)
    -at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:258)
    -at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
    -at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    -at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    -at org.dspace.solr.filters.LocalHostRestrictionFilter.doFilter(LocalHostRestrictionFilter.java:50)
    -at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    -at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    -at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
    -at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
    -at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)
    -at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
    -at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
    -at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180)
    -at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956)
    -at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
    -at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436)
    -at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078)
    -at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625)
    -at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316)
    -at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    -at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    -at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    -at java.lang.Thread.run(Thread.java:748)
    +    at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:613)
    +    at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:199)
    +    at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188)
    +    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
    +    at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729)
    +    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:258)
    +    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
    +    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    +    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    +    at org.dspace.solr.filters.LocalHostRestrictionFilter.doFilter(LocalHostRestrictionFilter.java:50)
    +    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    +    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    +    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
    +    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
    +    at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)
    +    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
    +    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
    +    at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180)
    +    at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956)
    +    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
    +    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436)
    +    at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078)
    +    at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625)
    +    at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316)
    +    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    +    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    +    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    +    at java.lang.Thread.run(Thread.java:748)
     Caused by: org.apache.solr.common.SolrException: Unable to create core [statistics-2018]
    -at org.apache.solr.core.CoreContainer.create(CoreContainer.java:507)
    -at org.apache.solr.core.CoreContainer.create(CoreContainer.java:466)
    -at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:575)
    -... 27 more
    +    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:507)
    +    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:466)
    +    at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:575)
    +    ... 27 more
     Caused by: org.apache.solr.common.SolrException: Error opening new searcher
    -at org.apache.solr.core.SolrCore.<init>(SolrCore.java:873)
    -at org.apache.solr.core.SolrCore.<init>(SolrCore.java:646)
    -at org.apache.solr.core.CoreContainer.create(CoreContainer.java:491)
    -... 29 more
    +    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:873)
    +    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:646)
    +    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:491)
    +    ... 29 more
     Caused by: org.apache.solr.common.SolrException: Error opening new searcher
    -at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
    -at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
    -at org.apache.solr.core.SolrCore.<init>(SolrCore.java:845)
    -... 31 more
    +    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
    +    at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
    +    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:845)
    +    ... 31 more
     Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
    -at org.apache.lucene.store.Lock.obtain(Lock.java:89)
    -at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:753)
    -at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77)
    -at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
    -at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:279)
    -at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111)
    -at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528)
    -... 33 more
    -
  • - -
  • I found some threads on StackOverflow etc discussing this and several suggested increasing the address space for the shell with ulimit

  • - -
  • I added ulimit -v unlimited to the /etc/default/tomcat7 and restarted Tomcat and now Solr is working again:

  • + at org.apache.lucene.store.Lock.obtain(Lock.java:89) + at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:753) + at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77) + at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64) + at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:279) + at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111) + at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528) + ... 33 more + - -

    Solr stats working

    - +

    Solr stats working

    +
  • Abenet was asking if the Atmire Usage Stats are correct because they are over 2 million the last few months…
  • For 2019-01 alone the Usage Stats are already around 1.2 million
  • - -
  • I tried to look in the nginx logs to see how many raw requests there are so far this month and it’s about 1.4 million:

    - +
  • I tried to look in the nginx logs to see how many raw requests there are so far this month and it's about 1.4 million:
  • +
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
     1442874
     
     real    0m17.161s
     user    0m16.205s
     sys     0m2.396s
    -
    - - -

    2019-01-17

    - +

    2019-01-17

    +
  • And what is the relationship between DC and DCTERMS?
  • DSpace uses DCTERMS in the metadata it embeds in XMLUI item views!
  • We really need to look at this more carefully and see the impacts that might be made from switching core fields like languages, abstract, authors, etc
  • @@ -793,14 +738,13 @@ sys 0m2.396s
  • I think I understand the difference between DC and DCTERMS finally: DC is the original set of fifteen elements and DCTERMS is the newer version that was supposed to address much of the drawbacks of the original with regards to digital content
  • We might be able to use some proper fields for citation, abstract, etc that are part of DCTERMS
  • To make matters more confusing, there is also “qualified Dublin Core” that uses the original fifteen elements of legacy DC and qualifies them, like dc.date.accessioned -
  • + +
  • So we should be trying to use DCTERMS where possible, unless it is some internal thing that might mess up DSpace (like dates)
  • “Elements 1.1” means legacy DC
  • Possible action list for CGSpace: -
  • - -

    2019-01-19

    - + + +

    2019-01-19

    - +
  • +

    There's no official set of Dublin Core qualifiers so I can't tell if things like dc.contributor.author that are used by DSpace are official

    +
  • +
  • +

    I found a great presentation from 2015 by the Digital Repository of Ireland that discusses using MARC Relator Terms with Dublin Core elements

    +
  • +
  • +

    It seems that dc.contributor.author would be a supported term according to this Library of Congress list linked from the Dublin Core website

    +
  • +
  • +

    The Library of Congress document specifically says:

    These terms conform with the DCMI Abstract Model and may be used in DCMI application profiles. DCMI endorses their use with Dublin Core elements as indicated.

    - -

    2019-01-20

    - - - -

    2019-01-21

    - +

    2019-01-20

    +
    # w
    + 04:46:14 up 213 days,  7:25,  4 users,  load average: 1.94, 1.50, 1.35
    +
    +

    2019-01-21

    +
    [Unit]
     Description=Apache Tomcat 7 Web Application Container
     After=network.target
    @@ -869,10 +806,9 @@ User=aorth
     Group=aorth
     [Install]
     WantedBy=multi-user.target
    -
  • - -
  • Or try to use adapt a real systemd service like Arch Linux’s:

    - +
    [Unit]
     Description=Tomcat 7 servlet container
     After=network.target
    @@ -889,58 +825,49 @@ Environment=ERRFILE=SYSLOG
     Environment=OUTFILE=SYSLOG
     
     ExecStart=/usr/bin/jsvc \
    -        -Dcatalina.home=${CATALINA_HOME} \
    -        -Dcatalina.base=${CATALINA_BASE} \
    -        -Djava.io.tmpdir=/var/tmp/tomcat7/temp \
    -        -cp /usr/share/java/commons-daemon.jar:/usr/share/java/eclipse-ecj.jar:${CATALINA_HOME}/bin/bootstrap.jar:${CATALINA_HOME}/bin/tomcat-juli.jar \
    -        -user tomcat7 \
    -        -java-home ${TOMCAT_JAVA_HOME} \
    -        -pidfile /var/run/tomcat7.pid \
    -        -errfile ${ERRFILE} \
    -        -outfile ${OUTFILE} \
    -        $CATALINA_OPTS \
    -        org.apache.catalina.startup.Bootstrap
    +            -Dcatalina.home=${CATALINA_HOME} \
    +            -Dcatalina.base=${CATALINA_BASE} \
    +            -Djava.io.tmpdir=/var/tmp/tomcat7/temp \
    +            -cp /usr/share/java/commons-daemon.jar:/usr/share/java/eclipse-ecj.jar:${CATALINA_HOME}/bin/bootstrap.jar:${CATALINA_HOME}/bin/tomcat-juli.jar \
    +            -user tomcat7 \
    +            -java-home ${TOMCAT_JAVA_HOME} \
    +            -pidfile /var/run/tomcat7.pid \
    +            -errfile ${ERRFILE} \
    +            -outfile ${OUTFILE} \
    +            $CATALINA_OPTS \
    +            org.apache.catalina.startup.Bootstrap
     
     ExecStop=/usr/bin/jsvc \
    -        -pidfile /var/run/tomcat7.pid \
    -        -stop \
    -        org.apache.catalina.startup.Bootstrap
    +            -pidfile /var/run/tomcat7.pid \
    +            -stop \
    +            org.apache.catalina.startup.Bootstrap
     
     [Install]
     WantedBy=multi-user.target
    -
  • - -
  • I see that jsvc and libcommons-daemon-java are both available on Ubuntu so that should be easy to port

  • - -
  • We probably don’t need Eclipse Java Bytecode Compiler (ecj)

  • - -
  • I tested Tomcat 7.0.92 on Arch Linux using the tomcat7.service with jsvc and it works… nice!

  • - -
  • I think I might manage this the same way I do the restic releases in the Ansible infrastructure scripts, where I download a specific version and symlink to some generic location without the version number

  • - -
  • I verified that there is indeed an issue with sharded Solr statistics cores on DSpace, which will cause inaccurate results in the dspace-statistics-api:

    - +
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view' | grep numFound
     <result name="response" numFound="33" start="0">
     $ http 'http://localhost:3000/solr/statistics-2018/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view' | grep numFound
     <result name="response" numFound="241" start="0">
    -
  • - -
  • I opened an issue on the GitHub issue tracker (#10)

  • - -
  • I don’t think the SolrClient library we are currently using supports these type of queries so we might have to just do raw queries with requests

  • - -
  • The pysolr library says it supports multicore indexes, but I am not sure it does (or at least not with our setup):

    - +
    import pysolr
     solr = pysolr.Solr('http://localhost:3000/solr/statistics')
     results = solr.search('type:2', **{'fq': 'isBot:false AND statistics_type:view', 'facet': 'true', 'facet.field': 'id', 'facet.mincount': 1, 'facet.limit': 10, 'facet.offset': 0, 'rows': 0})
     print(results.facets['facet_fields'])
     {'id': ['77572', 646, '93185', 380, '92932', 375, '102499', 372, '101430', 337, '77632', 331, '102449', 289, '102485', 276, '100849', 270, '47080', 260]}
    -
  • - -
  • If I double check one item from above, for example 77572, it appears this is only working on the current statistics core and not the shards:

    - +
    import pysolr
     solr = pysolr.Solr('http://localhost:3000/solr/statistics')
     results = solr.search('type:2 id:77572', **{'fq': 'isBot:false AND statistics_type:view'})
    @@ -950,19 +877,15 @@ solr = pysolr.Solr('http://localhost:3000/solr/statistics-2018/')
     results = solr.search('type:2 id:77572', **{'fq': 'isBot:false AND statistics_type:view'})
     print(results.hits)
     595
    -
  • - -
  • So I guess I need to figure out how to use join queries and maybe even switch to using raw Python requests with JSON

  • - -
  • This enumerates the list of Solr cores and returns JSON format:

    - +
    http://localhost:3000/solr/admin/cores?action=STATUS&wt=json
    -
  • - -
  • I think I figured out how to search across shards, I needed to give the whole URL to each other core

  • - -
  • Now I get more results when I start adding the other statistics cores:

    - +
    $ http 'http://localhost:3000/solr/statistics/select?&indent=on&rows=0&q=*:*' | grep numFound<result name="response" numFound="2061320" start="0">
     $ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/solr/statistics-2018&indent=on&rows=0&q=*:*' | grep numFound
     <result name="response" numFound="16280292" start="0" maxScore="1.0">
    @@ -970,415 +893,349 @@ $ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/
     <result name="response" numFound="25606142" start="0" maxScore="1.0">
     $ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/solr/statistics-2018,localhost:8081/solr/statistics-2017,localhost:8081/solr/statistics-2016&indent=on&rows=0&q=*:*' | grep numFound
     <result name="response" numFound="31532212" start="0" maxScore="1.0">
    -
  • - -
  • I should be able to modify the dspace-statistics-api to check the shards via the Solr core status, then add the shards parameter to each query to make the search distributed among the cores

  • - -
  • I implemented a proof of concept to query the Solr STATUS for active cores and to add them with a shards query string

  • - -
  • A few things I noticed:

    - +
    $ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view&shards=localhost:8081/solr/statistics,localhost:8081/solr/statistics-2018' | grep numFound
     <result name="response" numFound="275" start="0" maxScore="12.205825">
     $ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view&shards=localhost:8081/solr/statistics-2018' | grep numFound
     <result name="response" numFound="241" start="0" maxScore="12.205825">
    -
  • - - - -

    2019-01-22

    - +

    2019-01-22

    - -

    2019-01-23

    - +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "22/Jan/2019:1(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    155 40.77.167.106
    +    176 2003:d5:fbda:1c00:1106:c7a0:4b17:3af8
    +    189 107.21.16.70
    +    217 54.83.93.85
    +    310 46.174.208.142
    +    346 83.103.94.48
    +    360 45.5.186.2
    +    595 154.113.73.30
    +    716 196.191.127.37
    +    915 35.237.175.180
    +
    +
    Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/7.0.185.1002 Safari/537.36
    +
    +
    Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36
    +

    2019-01-23

    -

    #ILRI research: Towards unlocking the potential of the hides and skins value chain in Somaliland https://t.co/EZH7ALW4dp

    — ILRI Communications (@ILRI) January 18, 2019
    - - -

    Dynamic Link not found

    - +

    Dynamic Link not found

    + +
  • +

    Create accounts for Bosun from IITA and Valerio from ICARDA / CGMEL on DSpace Test

    +
  • +
  • +

    Maria Garruccio asked me for a list of author affiliations from all of their submitted items so she can clean them up

    +
  • +
  • +

    I got a list of their collections from the CGSpace XMLUI and then used an SQL query to dump the unique values to CSV:

    +
  • +
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/35501', '10568/41728', '10568/49622', '10568/56589', '10568/56592', '10568/65064', '10568/65718', '10568/65719', '10568/67373', '10568/67731', '10568/68235', '10568/68546', '10568/69089', '10568/69160', '10568/69419', '10568/69556', '10568/70131', '10568/70252', '10568/70978'))) group by text_value order by count desc) to /tmp/bioversity-affiliations.csv with csv;
     COPY 1109
    -
    - -
  • Send a mail to the dspace-tech mailing list about the OpenSearch issue we had with the Livestock CRP

  • - -
  • Linode sent an alert that CGSpace (linode18) had a high load this morning, here are the top ten IPs during that time:

    - +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "23/Jan/2019:0(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -222 54.226.25.74
    -241 40.77.167.13
    -272 46.101.86.248
    -297 35.237.175.180
    -332 45.5.184.72
    -355 34.218.226.147
    -404 66.249.64.155
    -4637 205.186.128.185
    -4637 70.32.83.92
    -9265 45.5.186.2
    -
  • - -
  • I think it’s the usual IPs:

    - + 222 54.226.25.74 + 241 40.77.167.13 + 272 46.101.86.248 + 297 35.237.175.180 + 332 45.5.184.72 + 355 34.218.226.147 + 404 66.249.64.155 + 4637 205.186.128.185 + 4637 70.32.83.92 + 9265 45.5.186.2 + +
  • +
  • +

    Following up on the thumbnail issue that we had in 2018-12

    +
  • +
  • +

    It looks like the two items with problematic PDFs both have thumbnails now:

  • - -
  • Just to make sure these were not uploaded by the user or something, I manually forced the regeneration of these with DSpace’s filter-media:

    - +
  • 10568/98390
  • +
  • 10568/98391
  • + + +
  • +

    Just to make sure these were not uploaded by the user or something, I manually forced the regeneration of these with DSpace's filter-media:

    +
  • +
    $ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace filter-media -v -f -i 10568/98390
     $ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace filter-media -v -f -i 10568/98391
    -
    - -
  • Both of these were successful, so there must have been an update to ImageMagick or Ghostscript in Ubuntu since early 2018-12

  • - -
  • Looking at the apt history logs I see that on 2018-12-07 a security update for Ghostscript was installed (version 9.26~dfsg+0-0ubuntu0.16.04.3)

  • - -
  • I think this Launchpad discussion is relevant: https://bugs.launchpad.net/ubuntu/+source/ghostscript/+bug/1806517

  • - -
  • As well as the original Ghostscript bug report: https://bugs.ghostscript.com/show_bug.cgi?id=699815

  • + - -

    2019-01-24

    - +

    2019-01-24

    $ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
     zsh: abort (core dumped)  identify Food\ safety\ Kenya\ fruits.pdf\[0\]
     $ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
     Food safety Kenya fruits.pdf[0]=>Food safety Kenya fruits.pdf PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.000u 0:00.000
     identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1747.
    -
    - -
  • I reported it to the Arch Linux bug tracker (61513)

  • - -
  • I told Atmire to go ahead with the Metadata Quality Module addition based on our 5_x-dev branch (657)

  • - -
  • Linode sent alerts last night to say that CGSpace (linode18) was using high CPU last night, here are the top ten IPs from the nginx logs around that time:

    - +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "23/Jan/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -305 3.81.136.184
    -306 3.83.14.11
    -306 52.54.252.47
    -325 54.221.57.180
    -378 66.249.64.157
    -424 54.70.40.11
    -497 47.29.247.74
    -783 35.237.175.180
    -1108 66.249.64.155
    -2378 45.5.186.2
    -
  • - -
  • 45.5.186.2 is CIAT and 66.249.64.155 is Google… hmmm.

  • - -
  • Linode sent another alert this morning, here are the top ten IPs active during that time:

    - + 305 3.81.136.184 + 306 3.83.14.11 + 306 52.54.252.47 + 325 54.221.57.180 + 378 66.249.64.157 + 424 54.70.40.11 + 497 47.29.247.74 + 783 35.237.175.180 + 1108 66.249.64.155 + 2378 45.5.186.2 +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "24/Jan/2019:0(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -360 3.89.134.93
    -362 34.230.15.139
    -366 100.24.48.177
    -369 18.212.208.240
    -377 3.81.136.184
    -404 54.221.57.180
    -506 66.249.64.155
    -4642 70.32.83.92
    -4643 205.186.128.185
    -8593 45.5.186.2
    -
  • - -
  • Just double checking what CIAT is doing, they are mainly hitting the REST API:

    - + 360 3.89.134.93 + 362 34.230.15.139 + 366 100.24.48.177 + 369 18.212.208.240 + 377 3.81.136.184 + 404 54.221.57.180 + 506 66.249.64.155 + 4642 70.32.83.92 + 4643 205.186.128.185 + 8593 45.5.186.2 +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "24/Jan/2019:" | grep 45.5.186.2 | grep -Eo "GET /(handle|bitstream|rest|oai)/" | sort | uniq -c | sort -n
    -
  • - -
  • CIAT’s community currently has 12,000 items in it so this is normal

  • - -
  • The issue with goo.gl links that we saw yesterday appears to be resolved, as links are working again…

  • - -
  • For example: https://goo.gl/fb/VRj9Gq

  • - -
  • The full list of MARC Relators on the Library of Congress website linked from the DMCI relators page is very confusing

  • - -
  • Looking at the default DSpace XMLUI crosswalk in xhtml-head-item.properties I see a very complete mapping of DSpace DC and QDC fields to DCTERMS

    - + - -

    2019-01-25

    - +
  • +
  • I sent a message titled “DC, QDC, and DCTERMS: reviewing our metadata practices” to the dspace-tech mailing list to ask about some of this
  • + +

    2019-01-25

    - -

    2019-01-27

    - + + +

    2019-01-27

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "27/Jan/2019:0(6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -189 40.77.167.108
    -191 157.55.39.2
    -263 34.218.226.147
    -283 45.5.184.2
    -332 45.5.184.72
    -608 5.9.6.51
    -679 66.249.66.223
    -1116 66.249.66.219
    -4644 205.186.128.185
    -4644 70.32.83.92
    -
    - -
  • I think it’s the usual IPs:

    - + 189 40.77.167.108 + 191 157.55.39.2 + 263 34.218.226.147 + 283 45.5.184.2 + 332 45.5.184.72 + 608 5.9.6.51 + 679 66.249.66.223 + 1116 66.249.66.219 + 4644 205.186.128.185 + 4644 70.32.83.92 + - -

    2019-01-28

    - +
  • + +

    2019-01-28

    + +
  • Linode alerted that CGSpace (linode18) was using too much CPU again this morning, here are the active IPs from the web server log at the time:
  • +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "28/Jan/2019:0(6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    - 67 207.46.13.50
    -105 41.204.190.40
    -117 34.218.226.147
    -126 35.237.175.180
    -203 213.55.99.121
    -332 45.5.184.72
    -377 5.9.6.51
    -512 45.5.184.2
    -4644 205.186.128.185
    -4644 70.32.83.92
    -
    - -
  • There seems to be a pattern with 70.32.83.92 and 205.186.128.185 lately!

  • - -
  • Every morning at 8AM they are the top users… I should tell them to stagger their requests…

  • - -
  • I signed up for a VisualPing of the PostgreSQL JDBC driver download page to my CGIAR email address

    - + 67 207.46.13.50 + 105 41.204.190.40 + 117 34.218.226.147 + 126 35.237.175.180 + 203 213.55.99.121 + 332 45.5.184.72 + 377 5.9.6.51 + 512 45.5.184.2 + 4644 205.186.128.185 + 4644 70.32.83.92 + - -

    2019-01-29

    - +
  • +
  • Last night Linode sent an alert that CGSpace (linode18) was using high CPU, here are the most active IPs in the hours just before, during, and after the alert:
  • + +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "28/Jan/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    310 45.5.184.2
    +    425 5.143.231.39
    +    526 54.70.40.11
    +   1003 199.47.87.141
    +   1374 35.237.175.180
    +   1455 5.9.6.51
    +   1501 66.249.66.223
    +   1771 66.249.66.219
    +   2107 199.47.87.140
    +   2540 45.5.186.2
    +
    +
    TurnitinBot (https://turnitin.com/robot/crawlerinfo.html)
    +

    2019-01-29

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "29/Jan/2019:0(3|4|5|6|7)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -334 45.5.184.72
    -429 66.249.66.223
    -522 35.237.175.180
    -555 34.218.226.147
    -655 66.249.66.221
    -844 5.9.6.51
    -2507 66.249.66.219
    -4645 70.32.83.92
    -4646 205.186.128.185
    -9329 45.5.186.2
    -
    - -
  • 45.5.186.2 is CIAT as usual…

  • - -
  • 70.32.83.92 and 205.186.128.185 are CCAFS as usual…

  • - -
  • 66.249.66.219 is Google…

  • - -
  • I’m thinking it might finally be time to increase the threshold of the Linode CPU alerts

    - + 334 45.5.184.72 + 429 66.249.66.223 + 522 35.237.175.180 + 555 34.218.226.147 + 655 66.249.66.221 + 844 5.9.6.51 + 2507 66.249.66.219 + 4645 70.32.83.92 + 4646 205.186.128.185 + 9329 45.5.186.2 + - -

    2019-01-30

    - +
  • + +

    2019-01-30

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "30/Jan/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -273 46.101.86.248
    -301 35.237.175.180
    -334 45.5.184.72
    -387 5.9.6.51
    -527 2a01:4f8:13b:1296::2
    -1021 34.218.226.147
    -1448 66.249.66.219
    -4649 205.186.128.185
    -4649 70.32.83.92
    -5163 45.5.184.2
    -
    - -
  • I might need to adjust the threshold again, because the load average this morning was 296% and the activity looks pretty normal (as always recently)

  • + 273 46.101.86.248 + 301 35.237.175.180 + 334 45.5.184.72 + 387 5.9.6.51 + 527 2a01:4f8:13b:1296::2 + 1021 34.218.226.147 + 1448 66.249.66.219 + 4649 205.186.128.185 + 4649 70.32.83.92 + 5163 45.5.184.2 + - -

    2019-01-31

    - +

    2019-01-31

    - - +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "30/Jan/2019:(16|17|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    436 18.196.196.108
    +    460 157.55.39.168
    +    460 207.46.13.96
    +    500 197.156.105.116
    +    728 54.70.40.11
    +   1560 5.9.6.51
    +   1562 35.237.175.180
    +   1601 85.25.237.71
    +   1894 66.249.66.219
    +   2610 45.5.184.2
    +# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "31/Jan/2019:0(2|3|4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    318 207.46.13.242
    +    334 45.5.184.72
    +    486 35.237.175.180
    +    609 34.218.226.147
    +    620 66.249.66.219
    +   1054 5.9.6.51
    +   4391 70.32.83.92
    +   4428 205.186.128.185
    +   6758 85.25.237.71
    +   9239 45.5.186.2
    +
    +
    Linguee Bot (http://www.linguee.com/bot; bot@linguee.com)
    +
    diff --git a/docs/2019-02/index.html b/docs/2019-02/index.html index f24b5d896..c94b0b74c 100644 --- a/docs/2019-02/index.html +++ b/docs/2019-02/index.html @@ -8,28 +8,23 @@ @@ -49,28 +43,23 @@ sys 0m1.979s - + @@ -162,211 +150,176 @@ sys 0m1.979s

    -

    2019-02-01

    - +

    2019-02-01

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -245 207.46.13.5
    -332 54.70.40.11
    -385 5.143.231.38
    -405 207.46.13.173
    -405 207.46.13.75
    -1117 66.249.66.219
    -1121 35.237.175.180
    -1546 5.9.6.51
    -2474 45.5.186.2
    -5490 85.25.237.71
    -
    - -
  • 85.25.237.71 is the “Linguee Bot” that I first saw last month

  • - -
  • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase

  • - -
  • There were just over 3 million accesses in the nginx logs last month:

    - + 245 207.46.13.5 + 332 54.70.40.11 + 385 5.143.231.38 + 405 207.46.13.173 + 405 207.46.13.75 + 1117 66.249.66.219 + 1121 35.237.175.180 + 1546 5.9.6.51 + 2474 45.5.186.2 + 5490 85.25.237.71 +
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
     3018243
     
     real    0m19.873s
     user    0m22.203s
     sys     0m1.979s
    -
  • - - - +

    2019-02-02

    - -

    2019-02-03

    - +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Feb/2019:0(1|2|3|4|5)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    284 18.195.78.144
    +    329 207.46.13.32
    +    417 35.237.175.180
    +    448 34.218.226.147
    +    694 2a01:4f8:13b:1296::2
    +    718 2a01:4f8:140:3192::2
    +    786 137.108.70.14
    +   1002 5.9.6.51
    +   6077 85.25.237.71
    +   8726 45.5.184.2
    +
    +

    2019-02-03

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -325 85.25.237.71
    -340 45.5.184.72
    -431 5.143.231.8
    -756 5.9.6.51
    -1048 34.218.226.147
    -1203 66.249.66.219
    -1496 195.201.104.240
    -4658 205.186.128.185
    -4658 70.32.83.92
    -4852 45.5.184.2
    -
    - -
  • 45.5.184.2 is CIAT, 70.32.83.92 and 205.186.128.185 are Macaroni Bros harvesters for CCAFS I think

  • - -
  • 195.201.104.240 is a new IP address in Germany with the following user agent:

    - + 325 85.25.237.71 + 340 45.5.184.72 + 431 5.143.231.8 + 756 5.9.6.51 + 1048 34.218.226.147 + 1203 66.249.66.219 + 1496 195.201.104.240 + 4658 205.186.128.185 + 4658 70.32.83.92 + 4852 45.5.184.2 +
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
    -
  • - -
  • This user was making 20–60 requests per minute this morning… seems like I should try to block this type of behavior heuristically, regardless of user agent!

    - +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Feb/2019" | grep 195.201.104.240 | grep -o -E '03/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 20
    - 19 03/Feb/2019:07:42
    - 20 03/Feb/2019:07:12
    - 21 03/Feb/2019:07:27
    - 21 03/Feb/2019:07:28
    - 25 03/Feb/2019:07:23
    - 25 03/Feb/2019:07:29
    - 26 03/Feb/2019:07:33
    - 28 03/Feb/2019:07:38
    - 30 03/Feb/2019:07:31
    - 33 03/Feb/2019:07:35
    - 33 03/Feb/2019:07:37
    - 38 03/Feb/2019:07:40
    - 43 03/Feb/2019:07:24
    - 43 03/Feb/2019:07:32
    - 46 03/Feb/2019:07:36
    - 47 03/Feb/2019:07:34
    - 47 03/Feb/2019:07:39
    - 47 03/Feb/2019:07:41
    - 51 03/Feb/2019:07:26
    - 59 03/Feb/2019:07:25
    -
  • - -
  • At least they re-used their Tomcat session!

    - + 19 03/Feb/2019:07:42 + 20 03/Feb/2019:07:12 + 21 03/Feb/2019:07:27 + 21 03/Feb/2019:07:28 + 25 03/Feb/2019:07:23 + 25 03/Feb/2019:07:29 + 26 03/Feb/2019:07:33 + 28 03/Feb/2019:07:38 + 30 03/Feb/2019:07:31 + 33 03/Feb/2019:07:35 + 33 03/Feb/2019:07:37 + 38 03/Feb/2019:07:40 + 43 03/Feb/2019:07:24 + 43 03/Feb/2019:07:32 + 46 03/Feb/2019:07:36 + 47 03/Feb/2019:07:34 + 47 03/Feb/2019:07:39 + 47 03/Feb/2019:07:41 + 51 03/Feb/2019:07:26 + 59 03/Feb/2019:07:25 +
    $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=195.201.104.240' dspace.log.2019-02-03 | sort | uniq | wc -l
     1
    -
  • - -
  • This user was making requests to /browse, which is not currently under the existing rate limiting of dynamic pages in our nginx config

    - + +
  • +
  • Run all system updates on linode20 and reboot it
  • - -

    2019-02-04

    - + + +

    2019-02-04

    dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=124 GROUP BY text_value ORDER BY COUNT DESC) to /tmp/cta-subjects.csv with csv header;
     COPY 321
    -
    - -
  • Skype with Michael Victor about CKM and CGSpace

  • - -
  • Discuss the new IITA research theme field with Abenet and decide that we should use cg.identifier.iitatheme

  • - -
  • This morning there was another alert from Linode about the high load on CGSpace (linode18), here are the top IPs in the web server logs before, during, and after that time:

    - -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -589 2a01:4f8:140:3192::2
    -762 66.249.66.219
    -889 35.237.175.180
    -1332 34.218.226.147
    -1393 5.9.6.51
    -1940 50.116.102.77
    -3578 85.25.237.71
    -4311 45.5.184.2
    -4658 205.186.128.185
    -4658 70.32.83.92
    -
  • - -
  • At this rate I think I just need to stop paying attention to these alerts—DSpace gets thrashed when people use the APIs properly and there’s nothing we can do to improve REST API performance!

  • - -
  • Perhaps I just need to keep increasing the Linode alert threshold (currently 300%) for this host?

  • + - -

    2019-02-05

    - +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    589 2a01:4f8:140:3192::2
    +    762 66.249.66.219
    +    889 35.237.175.180
    +   1332 34.218.226.147
    +   1393 5.9.6.51
    +   1940 50.116.102.77
    +   3578 85.25.237.71
    +   4311 45.5.184.2
    +   4658 205.186.128.185
    +   4658 70.32.83.92
    +
    +

    2019-02-05

    or(
    -isNotNull(value.match(/.*\uFFFD.*/)),
    -isNotNull(value.match(/.*\u00A0.*/)),
    -isNotNull(value.match(/.*\u200A.*/)),
    -isNotNull(value.match(/.*\u2019.*/)),
    -isNotNull(value.match(/.*\u00b4.*/)),
    -isNotNull(value.match(/.*\u007e.*/))
    +  isNotNull(value.match(/.*\uFFFD.*/)),
    +  isNotNull(value.match(/.*\u00A0.*/)),
    +  isNotNull(value.match(/.*\u200A.*/)),
    +  isNotNull(value.match(/.*\u2019.*/)),
    +  isNotNull(value.match(/.*\u00b4.*/)),
    +  isNotNull(value.match(/.*\u007e.*/))
     ).toString()
    -
    - -
  • Testing the corrections for sixty-five items and sixteen deletions using my fix-metadata-values.py and delete-metadata-values.py scripts:

    - +
    $ ./fix-metadata-values.py -i 2019-02-04-Correct-65-CTA-Subjects.csv -f cg.subject.cta -t CORRECT -m 124 -db dspace -u dspace -p 'fuu' -d
     $ ./delete-metadata-values.py -i 2019-02-04-Delete-16-CTA-Subjects.csv -f cg.subject.cta -m 124 -db dspace -u dspace -p 'fuu' -d
    -
  • - -
  • I applied them on DSpace Test and CGSpace and started a full Discovery re-index:

    - +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    -
  • - -
  • Peter had marked several terms with || to indicate multiple values in his corrections so I will have to go back and do those manually:

    - +
    EMPODERAMENTO DE JOVENS,EMPODERAMENTO||JOVENS
     ENVIRONMENTAL PROTECTION AND NATURAL RESOURCES MANAGEMENT,NATURAL RESOURCES MANAGEMENT||ENVIRONMENT
     FISHERIES AND AQUACULTURE,FISHERIES||AQUACULTURE
    @@ -375,468 +328,387 @@ MARKETING ET COMMERCE,MARKETING||COMMERCE
     NATURAL RESOURCES AND ENVIRONMENT,NATURAL RESOURCES MANAGEMENT||ENVIRONMENT
     PÊCHES ET AQUACULTURE,PÊCHES||AQUACULTURE
     PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
    -
  • - - -

    2019-02-06

    - +

    2019-02-06

    - -

    2019-02-07

    - +
    $ dspace metadata-export -i 10568/42211 -f /tmp/cta.csv
    +
    +
    $ csvcut -c "id,collection,cg.subject.cta,cg.subject.cta[],cg.subject.cta[en_US]" /tmp/cta.csv > /tmp/cta-subjects.csv
    +
    +
    $ dspace metadata-import -f /tmp/2019-02-06-CTA-multiple-subjects.csv
    +
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "06/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    689 35.237.175.180
    +   1236 5.9.6.51
    +   1305 34.218.226.147
    +   1580 66.249.66.219
    +   1939 50.116.102.77
    +   2313 108.212.105.35
    +   4666 205.186.128.185
    +   4666 70.32.83.92
    +   4950 85.25.237.71
    +   5158 45.5.186.2
    +
    +
    # zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep 45.5.186.2 | grep -o -E '06/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
    +    118 06/Feb/2019:05:46
    +    119 06/Feb/2019:05:37
    +    119 06/Feb/2019:05:47
    +    120 06/Feb/2019:05:43
    +    120 06/Feb/2019:05:44
    +    121 06/Feb/2019:05:38
    +    122 06/Feb/2019:05:39
    +    125 06/Feb/2019:05:42
    +    126 06/Feb/2019:05:40
    +    126 06/Feb/2019:05:41
    +
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '06/Feb/2019' | grep 45.5.186.2 | awk '{print $9}' | sort | uniq -c
    +  10411 200
    +      1 301
    +      7 302
    +      3 404
    +     18 499
    +      2 500
    +
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    328 220.247.212.35
    +    372 66.249.66.221
    +    380 207.46.13.2
    +    519 2a01:4f8:140:3192::2
    +    572 5.143.231.8
    +    689 35.237.175.180
    +    771 108.212.105.35
    +   1236 5.9.6.51
    +   1554 66.249.66.219
    +   4942 85.25.237.71
    +# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "06/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +     10 66.249.66.221
    +     26 66.249.66.219
    +     69 5.143.231.8
    +    340 45.5.184.72
    +   1040 34.218.226.147
    +   1542 108.212.105.35
    +   1937 50.116.102.77
    +   4661 205.186.128.185
    +   4661 70.32.83.92
    +   5102 45.5.186.2
    +

    2019-02-07

    - -

    IITA Posters and Presentations workflow step 1 empty

    - +
    # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "06/Feb/2019:(17|18|19|20|23)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +      5 66.249.66.209
    +      6 2a01:4f8:210:51ef::2
    +      6 40.77.167.75
    +      9 104.198.9.108
    +      9 157.55.39.192
    +     10 157.55.39.244
    +     12 66.249.66.221
    +     20 95.108.181.88
    +     27 66.249.66.219
    +   2381 45.5.186.2
    +# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Feb/2019:(17|18|19|20|23)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    455 45.5.186.2
    +    506 40.77.167.75
    +    559 54.70.40.11
    +    825 157.55.39.244
    +    871 2a01:4f8:140:3192::2
    +    938 157.55.39.192
    +   1058 85.25.237.71
    +   1416 5.9.6.51
    +   1606 66.249.66.219
    +   1718 35.237.175.180
    +
    +
    # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "07/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +      5 66.249.66.223
    +      8 104.198.9.108
    +     13 110.54.160.222
    +     24 66.249.66.219
    +     25 175.158.217.98
    +    214 34.218.226.147
    +    346 45.5.184.72
    +   4529 45.5.186.2
    +   4661 205.186.128.185
    +   4661 70.32.83.92
    +# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "07/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    145 157.55.39.237
    +    154 66.249.66.221
    +    214 34.218.226.147
    +    261 35.237.175.180
    +    273 2a01:4f8:140:3192::2
    +    300 169.48.66.92
    +    487 5.143.231.39
    +    766 5.9.6.51
    +    771 85.25.237.71
    +    848 66.249.66.219
    +
    +
    Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1056 by user 1759
    +
    +

    IITA Posters and Presentations workflow step 1 empty

    $ dspace test-email
     
     About to send test email:
    -- To: aorth@mjanja.ch
    -- Subject: DSpace test email
    -- Server: smtp.serv.cgnet.com
    + - To: aorth@mjanja.ch
    + - Subject: DSpace test email
    + - Server: smtp.serv.cgnet.com
     
     Error sending email:
    -- Error: javax.mail.MessagingException: Could not connect to SMTP host: smtp.serv.cgnet.com, port: 25;
    -nested exception is:
    -    java.net.ConnectException: Connection refused (Connection refused)
    + - Error: javax.mail.MessagingException: Could not connect to SMTP host: smtp.serv.cgnet.com, port: 25;
    +  nested exception is:
    +        java.net.ConnectException: Connection refused (Connection refused)
     
     Please see the DSpace documentation for assistance.
    -
    - -
  • I can’t connect to TCP port 25 on that server so I sent a mail to CGNET support to ask what’s up

  • - -
  • CGNET said these servers were discontinued in 2018-01 and that I should use Office 365

  • + - -

    2019-02-08

    - +

    2019-02-08

    - -

    2019-02-09

    - +
    Error sending email:
    + - Error: com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.57 SMTP; Client was not authenticated to send anonymous mail during MAIL FROM [AM6PR10CA0028.EURPRD10.PROD.OUTLOOK.COM]
    +
    +

    2019-02-09

    - -

    2019-02-10

    - +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "09/Feb/2019:(07|08|09|10|11)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    289 35.237.175.180
    +    290 66.249.66.221
    +    296 18.195.78.144
    +    312 207.46.13.201
    +    393 207.46.13.64
    +    526 2a01:4f8:140:3192::2
    +    580 151.80.203.180
    +    742 5.143.231.38
    +   1046 5.9.6.51
    +   1331 66.249.66.219
    +# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "09/Feb/2019:(07|08|09|10|11)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +      4 66.249.83.30
    +      5 49.149.10.16
    +      8 207.46.13.64
    +      9 207.46.13.201
    +     11 105.63.86.154
    +     11 66.249.66.221
    +     31 66.249.66.219
    +    297 2001:41d0:d:1990::
    +    908 34.218.226.147
    +   1947 50.116.102.77
    +
    +
    /bitstream/handle/10568/68981/Identifying%20benefit%20flows%20studies%20on%20the%20potential%20monetary%20and%20non%20monetary%20benefits%20arising%20from%20the%20International%20Treaty%20on%20Plant%20Genetic_1671.pdf?sequence=1&amp;isAllowed=../etc/passwd
    +
    +

    2019-02-10

    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -232 18.195.78.144
    -238 35.237.175.180
    -281 66.249.66.221
    -314 151.80.203.180
    -319 34.218.226.147
    -326 40.77.167.178
    -352 157.55.39.149
    -444 2a01:4f8:140:3192::2
    -1171 5.9.6.51
    -1196 66.249.66.219
    +    232 18.195.78.144
    +    238 35.237.175.180
    +    281 66.249.66.221
    +    314 151.80.203.180
    +    319 34.218.226.147
    +    326 40.77.167.178
    +    352 157.55.39.149
    +    444 2a01:4f8:140:3192::2
    +   1171 5.9.6.51
    +   1196 66.249.66.219
     # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -  6 112.203.241.69
    -  7 157.55.39.149
    -  9 40.77.167.178
    - 15 66.249.66.219
    -368 45.5.184.72
    -432 50.116.102.77
    -971 34.218.226.147
    -4403 45.5.186.2
    -4668 205.186.128.185
    -4668 70.32.83.92
    -
    - -
  • Another interesting thing might be the total number of requests for web and API services during that time:

    - + 6 112.203.241.69 + 7 157.55.39.149 + 9 40.77.167.178 + 15 66.249.66.219 + 368 45.5.184.72 + 432 50.116.102.77 + 971 34.218.226.147 + 4403 45.5.186.2 + 4668 205.186.128.185 + 4668 70.32.83.92 +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -cE "10/Feb/2019:0(5|6|7|8|9)"
     16333
     # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -cE "10/Feb/2019:0(5|6|7|8|9)"
     15964
    -
  • - -
  • Also, the number of unique IPs served during that time:

    - +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq | wc -l
     1622
     # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq | wc -l
     95
    -
  • - -
  • It’s very clear to me now that the API requests are the heaviest!

  • - -
  • I think I need to increase the Linode alert threshold from 300 to 350% now so I stop getting some of these alerts—it’s becoming a bit of the boy who cried wolf because it alerts like clockwork twice per day!

  • - -
  • Add my Python- and shell-based metadata workflow helper scripts as well as the environment settings for pipenv to our DSpace repository (#408) so I can track changes and distribute them more formally instead of just keeping them collected on the wiki

  • - -
  • Started adding IITA research theme (cg.identifier.iitatheme) to CGSpace

    - + +
  • +
  • Update ILRI author name style in the controlled vocabulary (Domelevo Entfellner, Jean-Baka) (#409)
  • - -
  • Last week Hector Tobon from CCAFS asked me about the Creative Commons 3.0 Intergovernmental Organizations (IGO) license because it is not in the list of SPDX licenses

    - + +
  • +
  • Last week Hector Tobon from CCAFS asked me about the Creative Commons 3.0 Intergovernmental Organizations (IGO) license because it is not in the list of SPDX licenses
  • - -
  • Testing the mail.server.disabled property that I noticed in dspace.cfg recently

    - + +
  • +
  • Testing the mail.server.disabled property that I noticed in dspace.cfg recently +
  • +
    Error sending email:
    -- Error: cannot test email because mail.server.disabled is set to true
    -
    - - -
  • I’m not sure why I didn’t know about this configuration option before, and always maintained multiple configurations for development and production

    - + - Error: cannot test email because mail.server.disabled is set to true + +
  • +
  • I updated my local Sonatype nexus Docker image and had an issue with the volume for some reason so I decided to just start from scratch:
  • +
    # docker rm nexus
     # docker pull sonatype/nexus3
     # mkdir -p /home/aorth/.local/lib/containers/volumes/nexus_data
     # chown 200:200 /home/aorth/.local/lib/containers/volumes/nexus_data
     # docker run --name nexus --network dspace-build -d -v /home/aorth/.local/lib/containers/volumes/nexus_data:/nexus-data -p 8081:8081 sonatype/nexus3
    -
    - -
  • For some reason my mvn package for DSpace is not working now… I might go back to using Artifactory for caching instead:

    - +
    # docker pull docker.bintray.io/jfrog/artifactory-oss:latest
     # mkdir -p /home/aorth/.local/lib/containers/volumes/artifactory5_data
     # chown 1030 /home/aorth/.local/lib/containers/volumes/artifactory5_data
     # docker run --name artifactory --network dspace-build -d -v /home/aorth/.local/lib/containers/volumes/artifactory5_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss
    -
  • - - -

    2019-02-11

    - +

    2019-02-11

    - -

    2019-02-12

    - +

    2019-02-12

    $ vipsthumbnail alc_contrastes_desafios.pdf -s 300 -o '%s.jpg[Q=92,optimize_coding,strip]'
    -
    - -
  • (DSpace 5 appears to use JPEG 92 quality so I do the same)

  • - -
  • Thinking about making “top items” endpoints in my dspace-statistics-api

  • - -
  • I could use the following SQL queries very easily to get the top items by views or downloads:

    - +
    dspacestatistics=# SELECT * FROM items WHERE views > 0 ORDER BY views DESC LIMIT 10;
     dspacestatistics=# SELECT * FROM items WHERE downloads > 0 ORDER BY downloads DESC LIMIT 10;
    -
  • - -
  • I’d have to think about what to make the REST API endpoints, perhaps: /statistics/top/items?limit=10

  • - -
  • But how do I do top items by views / downloads separately?

  • - -
  • I re-deployed DSpace 6.3 locally to test the PDFBox thumbnails, especially to see if they handle CMYK files properly

    - +
    $ identify -verbose alc_contrastes_desafios.pdf.jpg
     ...
    -Colorspace: sRGB
    -
  • - - -
  • I will read the PDFBox thumbnailer documentation to see if I can change the size and quality

  • + Colorspace: sRGB + - -

    2019-02-13

    - +

    2019-02-13

    mail.extraproperties = mail.smtp.starttls.required = true, mail.smtp.auth=true
    -
    - -
  • But the result is still:

    - +
    Error sending email:
    -- Error: com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.57 SMTP; Client was not authenticated to send anonymous mail during MAIL FROM [AM6PR06CA0001.eurprd06.prod.outlook.com]
    -
  • - -
  • I tried to log into the Outlook 365 web mail and it doesn’t work so I’ve emailed ILRI ICT again

  • - -
  • After reading the common mistakes in the JavaMail FAQ I reconfigured the extra properties in DSpace’s mail configuration to be simply:

    - + - Error: com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.57 SMTP; Client was not authenticated to send anonymous mail during MAIL FROM [AM6PR06CA0001.eurprd06.prod.outlook.com] +
    mail.extraproperties = mail.smtp.starttls.enable=true
    -
  • - -
  • … and then I was able to send a mail using my personal account where I know the credentials work

  • - -
  • The CGSpace account still gets this error message:

    - +
    Error sending email:
    -- Error: javax.mail.AuthenticationFailedException
    -
  • - -
  • I updated the DSpace SMTP settings in dspace.cfg as well as the variables in the DSpace role of the Ansible infrastructure scripts

  • - -
  • Thierry from CTA is having issues with his account on DSpace Test, and there is no admin password reset function on DSpace (only via email, which is disabled on DSpace Test), so I have to delete and re-create his account:

    - + - Error: javax.mail.AuthenticationFailedException +
    $ dspace user --delete --email blah@cta.int
     $ dspace user --add --givenname Thierry --surname Lewyllie --email blah@cta.int --password 'blah'
    -
  • - -
  • On this note, I saw a thread on the dspace-tech mailing list that says this functionality exists if you enable webui.user.assumelogin = true

  • - -
  • I will enable this on CGSpace (#411)

  • - -
  • Test re-creating my local PostgreSQL and Artifactory containers with podman instead of Docker (using the volumes from my old Docker containers though):

    - +
    # podman pull postgres:9.6-alpine
     # podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
     # podman pull docker.bintray.io/jfrog/artifactory-oss
     # podman run --name artifactory -d -v /home/aorth/.local/lib/containers/volumes/artifactory5_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss
    -
  • - -
  • Totally works… awesome!

  • - -
  • Then I tried with rootless containers by creating the subuid and subgid mappings for aorth:

    - +
    $ sudo touch /etc/subuid /etc/subgid
     $ usermod --add-subuids 10000-75535 aorth
     $ usermod --add-subgids 10000-75535 aorth
     $ sudo sysctl kernel.unprivileged_userns_clone=1
     $ podman pull postgres:9.6-alpine
     $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
    -
  • - -
  • Which totally works, but Podman’s rootless support doesn’t work with port mappings yet…

  • - -
  • Deploy the Tomcat-7-from-tarball branch on CGSpace (linode18), but first stop the Ubuntu Tomcat 7 and do some basic prep before running the Ansible playbook:

    - +
    # systemctl stop tomcat7
     # apt remove tomcat7 tomcat7-admin
     # useradd -m -r -s /bin/bash dspace
    @@ -845,93 +717,71 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
     # chown -R dspace:dspace /home/dspace
     # chown -R dspace:dspace /home/cgspace.cgiar.org
     # dpkg -P tomcat7-admin tomcat7-common
    -
  • - -
  • After running the playbook CGSpace came back up, but I had an issue with some Solr cores not being loaded (similar to last month) and this was in the Solr log:

    - -
    2019-02-14 18:17:31,304 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2018': Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
    -
  • - -
  • The issue last month was address space, which is now set as LimitAS=infinity in tomcat7.service

  • - -
  • I re-ran the Ansible playbook to make sure all configs etc were the, then rebooted the server

  • - -
  • Still the error persists after reboot

  • - -
  • I will try to stop Tomcat and then remove the locks manually:

    - -
    # find /home/cgspace.cgiar.org/solr/ -iname "write.lock" -delete
    -
  • - -
  • After restarting Tomcat the usage statistics are back

  • - -
  • Interestingly, many of the locks were from last month, last year, and even 2015! I’m pretty sure that’s not supposed to be how locks work…

  • - -
  • Help Sarah Kasyoka finish an item submission that she was having issues with due to the file size

  • - -
  • I increased the nginx upload limit, but she said she was having problems and couldn’t really tell me why

  • - -
  • I logged in as her and completed the submission with no problems…

  • + - -

    2019-02-15

    - +
    2019-02-14 18:17:31,304 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2018': Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
    +
    +
    # find /home/cgspace.cgiar.org/solr/ -iname "write.lock" -delete
    +
    +

    2019-02-15

    [Fri Feb 15 03:10:42 2019] Out of memory: Kill process 12027 (java) score 670 or sacrifice child
     [Fri Feb 15 03:10:42 2019] Killed process 12027 (java) total-vm:14108048kB, anon-rss:5450284kB, file-rss:0kB, shmem-rss:0kB
     [Fri Feb 15 03:10:43 2019] oom_reaper: reaped process 12027 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    -
    - -
  • The tomcat7 service shows:

    - +
    Feb 15 03:10:44 linode19 systemd[1]: tomcat7.service: Main process exited, code=killed, status=9/KILL
    -
  • - -
  • I suspect it was related to the media-filter cron job that runs at 3AM but I don’t see anything particular in the log files

  • - -
  • I want to try to normalize the text_lang values to make working with metadata easier

  • - -
  • We currently have a bunch of weird values that DSpace uses like NULL, en_US, and en and others that have been entered manually by editors:

    - +
    dspace=# SELECT DISTINCT text_lang, count(*) FROM metadatavalue WHERE resource_type_id=2 GROUP BY text_lang ORDER BY count DESC;
    -text_lang |  count
    + text_lang |  count
     -----------+---------
    -       | 1069539
    -en_US     |  577110
    -       |  334768
    -en        |  133501
    -es        |      12
    -*         |      11
    -es_ES     |       2
    -fr        |       2
    -spa       |       2
    -E.        |       1
    -ethnob    |       1
    -
  • - -
  • The majority are NULL, en_US, the blank string, and en—the rest are not enough to be significant

  • - -
  • Theoretically this field could help if you wanted to search for Spanish-language fields in the API or something, but even for the English fields there are two different values (and those are from DSpace itself)!

  • - -
  • I’m going to normalized these to NULL at least on DSpace Test for now:

    - + | 1069539 + en_US | 577110 + | 334768 + en | 133501 + es | 12 + * | 11 + es_ES | 2 + fr | 2 + spa | 2 + E. | 1 + ethnob | 1 +
    dspace=# UPDATE metadatavalue SET text_lang = NULL WHERE resource_type_id=2 AND text_lang IS NOT NULL;
     UPDATE 1045410
    -
  • - -
  • I started proofing IITA’s 2019-01 records that Sisay uploaded this week

    - + +
  • +
  • ILRI ICT fixed the password for the CGSpace support email account and I tested it on Outlook 365 web and DSpace and it works
  • +
  • Re-create my local PostgreSQL container to for new PostgreSQL version and to use podman's volumes:
  • +
    $ podman pull postgres:9.6-alpine
     $ podman volume create dspacedb_data
     $ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
    @@ -941,12 +791,10 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
     $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost dspace_2019-02-11.backup
     $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
     $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
    -
    - -
  • And it’s all running without root!

  • - -
  • Then re-create my Artifactory container as well, taking into account ulimit open file requirements by Artifactory as well as the user limitations caused by rootless subuid mappings:

    - +
    $ podman volume create artifactory_data
     artifactory_data
     $ podman create --ulimit nofile=32000:32000 --name artifactory -v artifactory_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss
    @@ -954,76 +802,64 @@ $ buildah unshare
     $ chown -R 1030:1030 ~/.local/share/containers/storage/volumes/artifactory_data
     $ exit
     $ podman start artifactory
    -
  • - -
  • More on the subuid permissions issue with rootless containers here

  • + - -

    2019-02-17

    - +

    2019-02-17

    $ dspace cleanup -v
     Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -Detail: Key (bitstream_id)=(162844) is still referenced from table "bundle".
    -
    - -
  • The solution is, as always:

    - + Detail: Key (bitstream_id)=(162844) is still referenced from table "bundle". +
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (162844);'
     UPDATE 1
    -
  • - -
  • I merged the Atmire Metadata Quality Module (MQM) changes to the 5_x-prod branch and deployed it on CGSpace (#407)

  • - -
  • Then I ran all system updates on CGSpace server and rebooted it

  • + - -

    2019-02-18

    - +

    2019-02-18

    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "18/Feb/2019:1(2|3|4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -1236 18.212.208.240
    -1276 54.164.83.99
    -1277 3.83.14.11
    -1282 3.80.196.188
    -1296 3.84.172.18
    -1299 100.24.48.177
    -1299 34.230.15.139
    -1327 52.54.252.47
    -1477 5.9.6.51
    -1861 94.71.244.172
    +   1236 18.212.208.240
    +   1276 54.164.83.99
    +   1277 3.83.14.11
    +   1282 3.80.196.188
    +   1296 3.84.172.18
    +   1299 100.24.48.177
    +   1299 34.230.15.139
    +   1327 52.54.252.47
    +   1477 5.9.6.51
    +   1861 94.71.244.172
     # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "18/Feb/2019:1(2|3|4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -  8 42.112.238.64
    -  9 121.52.152.3
    -  9 157.55.39.50
    - 10 110.54.151.102
    - 10 194.246.119.6
    - 10 66.249.66.221
    - 15 190.56.193.94
    - 28 66.249.66.219
    - 43 34.209.213.122
    -178 50.116.102.77
    +      8 42.112.238.64
    +      9 121.52.152.3
    +      9 157.55.39.50
    +     10 110.54.151.102
    +     10 194.246.119.6
    +     10 66.249.66.221
    +     15 190.56.193.94
    +     28 66.249.66.219
    +     43 34.209.213.122
    +    178 50.116.102.77
     # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "18/Feb/2019:1(2|3|4|5|6)" | awk '{print $1}' | sort | uniq | wc -l
     2727
     # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "18/Feb/2019:1(2|3|4|5|6)" | awk '{print $1}' | sort | uniq | wc -l
     186
    -
    - -
  • 94.71.244.172 is in Greece and uses the user agent “Indy Library”

  • - -
  • At least they are re-using their Tomcat session:

    - +
    $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=94.71.244.172' dspace.log.2019-02-18 | sort | uniq | wc -l
    -
  • - -
  • The following IPs were all hitting the server hard simultaneously and are located on Amazon and use the user agent “Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0”:

    - + +
  • +
  • +

    Actually, even up to the top 30 IPs are almost all on Amazon and use the same user agent!

    +
  • +
  • +

    For reference most of these IPs hitting the XMLUI this afternoon are on Amazon:

    +
  • +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "18/Feb/2019:1(2|3|4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 30
    -1173 52.91.249.23
    -1176 107.22.118.106
    -1178 3.88.173.152
    -1179 3.81.136.184
    -1183 34.201.220.164
    -1183 3.89.134.93
    -1184 54.162.66.53
    -1187 3.84.62.209
    -1188 3.87.4.140
    -1189 54.158.27.198
    -1190 54.209.39.13
    -1192 54.82.238.223
    -1208 3.82.232.144
    -1209 3.80.128.247
    -1214 54.167.64.164
    -1219 3.91.17.126
    -1220 34.201.108.226
    -1221 3.84.223.134
    -1222 18.206.155.14
    -1231 54.210.125.13
    -1236 18.212.208.240
    -1276 54.164.83.99
    -1277 3.83.14.11
    -1282 3.80.196.188
    -1296 3.84.172.18
    -1299 100.24.48.177
    -1299 34.230.15.139
    -1327 52.54.252.47
    -1477 5.9.6.51
    -1861 94.71.244.172
    -
    - -
  • In the case of 52.54.252.47 they are only making about 10 requests per minute during this time (albeit from dozens of concurrent IPs):

    - + 1173 52.91.249.23 + 1176 107.22.118.106 + 1178 3.88.173.152 + 1179 3.81.136.184 + 1183 34.201.220.164 + 1183 3.89.134.93 + 1184 54.162.66.53 + 1187 3.84.62.209 + 1188 3.87.4.140 + 1189 54.158.27.198 + 1190 54.209.39.13 + 1192 54.82.238.223 + 1208 3.82.232.144 + 1209 3.80.128.247 + 1214 54.167.64.164 + 1219 3.91.17.126 + 1220 34.201.108.226 + 1221 3.84.223.134 + 1222 18.206.155.14 + 1231 54.210.125.13 + 1236 18.212.208.240 + 1276 54.164.83.99 + 1277 3.83.14.11 + 1282 3.80.196.188 + 1296 3.84.172.18 + 1299 100.24.48.177 + 1299 34.230.15.139 + 1327 52.54.252.47 + 1477 5.9.6.51 + 1861 94.71.244.172 +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep 52.54.252.47 | grep -o -E '18/Feb/2019:1[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
    - 10 18/Feb/2019:17:20
    - 10 18/Feb/2019:17:22
    - 10 18/Feb/2019:17:31
    - 11 18/Feb/2019:13:21
    - 11 18/Feb/2019:15:18
    - 11 18/Feb/2019:16:43
    - 11 18/Feb/2019:16:57
    - 11 18/Feb/2019:16:58
    - 11 18/Feb/2019:18:34
    - 12 18/Feb/2019:14:37
    -
  • - -
  • As this user agent is not recognized as a bot by DSpace this will definitely fuck up the usage statistics

  • - -
  • There were 92,000 requests from these IPs alone today!

    - + 10 18/Feb/2019:17:20 + 10 18/Feb/2019:17:22 + 10 18/Feb/2019:17:31 + 11 18/Feb/2019:13:21 + 11 18/Feb/2019:15:18 + 11 18/Feb/2019:16:43 + 11 18/Feb/2019:16:57 + 11 18/Feb/2019:16:58 + 11 18/Feb/2019:18:34 + 12 18/Feb/2019:14:37 +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -c 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'
     92756
    -
  • - -
  • I will add this user agent to the “badbots” rate limiting in our nginx configuration

  • - -
  • I realized that I had effectively only been applying the “badbots” rate limiting to requests at the root, so I added it to the other blocks that match Discovery, Browse, etc as well

  • - -
  • IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary

  • - -
  • I will merge them with our existing list and then resolve their names using my resolve-orcids.py script:

    - +
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml 2019-02-18-IWMI-ORCID-IDs.txt  | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2019-02-18-combined-orcids.txt
     $ ./resolve-orcids.py -i /tmp/2019-02-18-combined-orcids.txt -o /tmp/2019-02-18-combined-names.txt -d
     # sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
     $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
    -
  • - -
  • I merged the changes to the 5_x-prod branch and they will go live the next time we re-deploy CGSpace (#412)

  • + - -

    2019-02-19

    - +

    2019-02-19

    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "19/Feb/2019:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -11541 18.212.208.240
    -11560 3.81.136.184
    -11562 3.88.237.84
    -11569 34.230.15.139
    -11572 3.80.128.247
    -11573 3.91.17.126
    -11586 54.82.89.217
    -11610 54.209.39.13
    -11657 54.175.90.13
    -14686 143.233.242.130
    -
    - -
  • 143.233.242.130 is in Greece and using the user agent “Indy Library”, like the top IP yesterday (94.71.244.172)

  • - -
  • That user agent is in our Tomcat list of crawlers so at least its resource usage is controlled by forcing it to use a single Tomcat session, but I don’t know if DSpace recognizes if this is a bot or not, so the logs are probably skewed because of this

  • - -
  • The user is requesting only things like /handle/10568/56199?show=full so it’s nothing malicious, only annoying

  • - -
  • Otherwise there are still shit loads of IPs from Amazon still hammering the server, though I see HTTP 503 errors now after yesterday’s nginx rate limiting updates

    - + 11541 18.212.208.240 + 11560 3.81.136.184 + 11562 3.88.237.84 + 11569 34.230.15.139 + 11572 3.80.128.247 + 11573 3.91.17.126 + 11586 54.82.89.217 + 11610 54.209.39.13 + 11657 54.175.90.13 + 14686 143.233.242.130 + - -

    Usage stats

    - +
  • +
  • The top requests in the API logs today are:
  • + +
    # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "19/Feb/2019:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +     42 66.249.66.221
    +     44 156.156.81.215
    +     55 3.85.54.129
    +     76 66.249.66.219
    +     87 34.209.213.122
    +   1550 34.218.226.147
    +   2127 50.116.102.77
    +   4684 205.186.128.185
    +  11429 45.5.186.2
    +  12360 2a01:7e00::f03c:91ff:fe0a:d645
    +
    +

    Usage stats

    # grep 41.190.30.105 /var/log/nginx/access.log | grep -c 'acgg_progress_report.pdf'
     185
    -
    - -
  • Wow, and another IP in Nigeria made a bunch more yesterday from the same user agent:

    - +
    # grep 41.190.3.229 /var/log/nginx/access.log.1 | grep -c 'acgg_progress_report.pdf'
     346
    -
  • - -
  • In the last two days alone there were 1,000 requests for this PDF, mostly from Nigeria!

    - -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep acgg_progress_report.pdf | grep -v 'upstream response is buffered' | awk '{print $1}' | sort | uniq -c | sort -n
    -  1 139.162.146.60
    -  1 157.55.39.159
    -  1 196.188.127.94
    -  1 196.190.127.16
    -  1 197.183.33.222
    -  1 66.249.66.221
    -  2 104.237.146.139
    -  2 175.158.209.61
    -  2 196.190.63.120
    -  2 196.191.127.118
    -  2 213.55.99.121
    -  2 82.145.223.103
    -  3 197.250.96.248
    -  4 196.191.127.125
    -  4 197.156.77.24
    -  5 105.112.75.237
    -185 41.190.30.105
    -346 41.190.3.229
    -503 41.190.31.73
    -
  • - -
  • That is so weird, they are all using this Android user agent:

    - -
    Mozilla/5.0 (Linux; Android 7.0; TECNO Camon CX Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/33.0.0.0 Mobile Safari/537.36
    -
  • - -
  • I wrote a quick and dirty Python script called resolve-addresses.py to resolve IP addresses to their owning organization’s name, ASN, and country using the IPAPI.co API

  • + - -

    2019-02-20

    - +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep acgg_progress_report.pdf | grep -v 'upstream response is buffered' | awk '{print $1}' | sort | uniq -c | sort -n
    +      1 139.162.146.60
    +      1 157.55.39.159
    +      1 196.188.127.94
    +      1 196.190.127.16
    +      1 197.183.33.222
    +      1 66.249.66.221
    +      2 104.237.146.139
    +      2 175.158.209.61
    +      2 196.190.63.120
    +      2 196.191.127.118
    +      2 213.55.99.121
    +      2 82.145.223.103
    +      3 197.250.96.248
    +      4 196.191.127.125
    +      4 197.156.77.24
    +      5 105.112.75.237
    +    185 41.190.30.105
    +    346 41.190.3.229
    +    503 41.190.31.73
    +
    +
    Mozilla/5.0 (Linux; Android 7.0; TECNO Camon CX Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/33.0.0.0 Mobile Safari/537.36
    +
    +

    2019-02-20

    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://cgspace.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.creator.id","value": "Alan S. Orth: 0000-0002-1735-7458", "language": ""}'
     $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://cgspace.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.creator.id","value": "Alan S. Orth: 0000-0002-1735-7458", "language": null}'
     $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://cgspace.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.creator.id","value": "Alan S. Orth: 0000-0002-1735-7458", "language": "en_US"}'
    -
    - -
  • This returns six items for me, which is the same I see in a Discovery search

  • - -
  • Hector Tobon from CIAT asked if it was possible to get item statistics from CGSpace so I told him to use my dspace-statistics-api

  • - -
  • I was playing with YasGUI to query AGROVOC’s SPARQL endpoint, but they must have a cached version or something because I get an HTTP 404 if I try to go to the endpoint manually

  • - -
  • I think I want to stick to the regular web services to validate AGROVOC terms

  • + - -

    YasGUI querying AGROVOC

    - +

    YasGUI querying AGROVOC

    - -

    2019-02-21

    - +

    2019-02-21

    $ ./agrovoc-lookup.py -l en -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-en.txt -or /tmp/rejected-subjects-en.txt
     $ ./agrovoc-lookup.py -l es -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-es.txt -or /tmp/rejected-subjects-es.txt
     $ ./agrovoc-lookup.py -l fr -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-fr.txt -or /tmp/rejected-subjects-fr.txt
    -
    - -
  • Then I generated a list of all the unique matched terms:

    - +
    $ cat /tmp/matched-subjects-* | sort | uniq > /tmp/2019-02-21-matched-subjects.txt
    -
  • - -
  • And then a list of all the unique unmatched terms using some utility I’ve never heard of before called comm or with diff:

    - +
    $ sort /tmp/top-1500-subjects.txt > /tmp/subjects-sorted.txt
     $ comm -13 /tmp/2019-02-21-matched-subjects.txt /tmp/subjects-sorted.txt > /tmp/2019-02-21-unmatched-subjects.txt
     $ diff --new-line-format="" --unchanged-line-format="" /tmp/subjects-sorted.txt /tmp/2019-02-21-matched-subjects.txt > /tmp/2019-02-21-unmatched-subjects.txt
    -
  • - -
  • Generate a list of countries and regions from CGSpace for Sisay to look through:

    - +
    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 228 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-countries.csv WITH CSV HEADER;
     COPY 202
     dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 227 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-regions.csv WITH CSV HEADER;
     COPY 33
    -
  • - -
  • I did a bit more work on the IITA research theme (adding it to Discovery search filters) and it’s almost ready so I created a pull request (#413)

  • - -
  • I still need to test the batch tagging of IITA items with themes based on their IITA subjects:

    - + - -

    2019-02-22

    - +
  • + +

    2019-02-22

    + +
  • +

    Start looking at IITA's latest round of batch uploads called “IITA_Feb_14” on DSpace Test

  • - -
  • I figured out how to query AGROVOC from OpenRefine using Jython by creating a custom text facet:

    - +
  • Lots of incorrect values in subjects, but that's a difficult problem to do in an automated way
  • + + +
  • +

    I figured out how to query AGROVOC from OpenRefine using Jython by creating a custom text facet:

    +
  • +
    import json
     import re
     import urllib
    @@ -1327,64 +1123,51 @@ import urllib2
     
     pattern = re.compile('^S[A-Z ]+$')
     if pattern.match(value):
    -url = 'http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=' + urllib.quote_plus(value) + '&lang=en'
    -get = urllib2.urlopen(url)
    -data = json.load(get)
    -if len(data['results']) == 1:
    -return "matched"
    +  url = 'http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=' + urllib.quote_plus(value) + '&lang=en'
    +  get = urllib2.urlopen(url)
    +  data = json.load(get)
    +  if len(data['results']) == 1:
    +    return "matched"
     
     return "unmatched"
    -
    - -
  • You have to make sure to URL encode the value with quote_plus() and it totally works, but it seems to refresh the facets (and therefore re-query everything) when you select a facet so that makes it basically unusable

  • - -
  • There is a good resource discussing OpenRefine, Jython, and web scraping

  • + - -

    2019-02-24

    - +

    2019-02-24

    +
        "results": [
    +        {
    +            "altLabel": "corn (maize)",
    +            "lang": "en",
    +            "prefLabel": "maize",
    +            "type": [
    +                "skos:Concept"
    +            ],
    +            "uri": "http://aims.fao.org/aos/agrovoc/c_12332",
    +            "vocab": "agrovoc"
    +        },
    +
    - -

    2019-02-25

    - + + +

    2019-02-25

    $ grep -c ERROR /home/cgspace.cgiar.org/log/solr.log.2019-02-*
     /home/cgspace.cgiar.org/log/solr.log.2019-02-11.xz:0
     /home/cgspace.cgiar.org/log/solr.log.2019-02-12.xz:0
    @@ -1400,12 +1183,10 @@ return "unmatched"
     /home/cgspace.cgiar.org/log/solr.log.2019-02-22.xz:0
     /home/cgspace.cgiar.org/log/solr.log.2019-02-23.xz:0
     /home/cgspace.cgiar.org/log/solr.log.2019-02-24:34
    -
    - -
  • But I don’t see anything interesting in yesterday’s Solr log…

  • - -
  • I see this in the Tomcat 7 logs yesterday:

    - +
    Feb 25 21:09:29 linode18 tomcat7[1015]: Error while updating
     Feb 25 21:09:29 linode18 tomcat7[1015]: java.lang.UnsupportedOperationException: Multiple update components target the same field:solr_update_time_stamp
     Feb 25 21:09:29 linode18 tomcat7[1015]:         at org.dspace.statistics.SolrLogger$9.visit(SourceFile:1241)
    @@ -1414,12 +1195,10 @@ Feb 25 21:09:29 linode18 tomcat7[1015]:         at org.dspace.statistics.SolrLog
     Feb 25 21:09:29 linode18 tomcat7[1015]:         at org.dspace.statistics.SolrLogger.update(SourceFile:1220)
     Feb 25 21:09:29 linode18 tomcat7[1015]:         at org.dspace.statistics.StatisticsLoggingConsumer.consume(SourceFile:103)
     ...
    -
  • - -
  • In the Solr admin GUI I see we have the following error: “statistics-2011: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher”

  • - -
  • I restarted Tomcat and upon startup I see lots of errors in the systemd journal, like:

    - +
    Feb 25 21:37:49 linode18 tomcat7[28363]: SEVERE: IOException while loading persisted sessions: java.io.StreamCorruptedException: invalid type code: 00
     Feb 25 21:37:49 linode18 tomcat7[28363]: java.io.StreamCorruptedException: invalid type code: 00
     Feb 25 21:37:49 linode18 tomcat7[28363]:         at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1601)
    @@ -1428,148 +1207,115 @@ Feb 25 21:37:49 linode18 tomcat7[28363]:         at java.io.ObjectInputStream.de
     Feb 25 21:37:49 linode18 tomcat7[28363]:         at java.lang.Throwable.readObject(Throwable.java:914)
     Feb 25 21:37:49 linode18 tomcat7[28363]:         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     Feb 25 21:37:49 linode18 tomcat7[28363]:         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    -
  • - -
  • I don’t think that’s related…

  • - -
  • Also, now the Solr admin UI says “statistics-2015: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher”

  • - -
  • In the Solr log I see:

    - +
    2019-02-25 21:38:14,246 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2015]: Error opening new searcher
     org.apache.solr.common.SolrException: Error opening new searcher
    -    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:873)
    -    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:646)
    +        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:873)
    +        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:646)
     ...
     Caused by: org.apache.solr.common.SolrException: Error opening new searcher
    -    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
    -    at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
    -    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:845)
    -    ... 31 more
    +        at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
    +        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
    +        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:845)
    +        ... 31 more
     Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2015/data/index/write.lock
    -    at org.apache.lucene.store.Lock.obtain(Lock.java:89)
    -    at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:753)
    -    at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77)
    -    at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
    -    at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:279)
    -    at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111)
    -    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528)
    -    ... 33 more
    +        at org.apache.lucene.store.Lock.obtain(Lock.java:89)
    +        at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:753)
    +        at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77)
    +        at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
    +        at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:279)
    +        at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111)
    +        at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528)
    +        ... 33 more
     2019-02-25 21:38:14,250 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2015': Unable to create core [statistics-2015] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2015/data/index/write.lock
    -
  • - -
  • I tried to shutdown Tomcat and remove the locks:

    - +
    # systemctl stop tomcat7
     # find /home/cgspace.cgiar.org/solr -iname "*.lock" -delete
     # systemctl start tomcat7
    -
  • - -
  • … but the problem still occurs

  • - -
  • I can see that there are still hits being recorded for items (in the Solr admin UI as well as my statistics API), so the main stats core is working at least!

  • - -
  • On a hunch I tried adding ulimit -v unlimited to the Tomcat catalina.sh and now Solr starts up with no core errors and I actually have statistics for January and February on some communities, but not others

  • - -
  • I wonder if the address space limits that I added via LimitAS=infinity in the systemd service are somehow not working?

  • - -
  • I did some tests with calling a shell script from systemd on DSpace Test (linode19) and the LimitAS setting does work, and the infinity setting in systemd does get translated to “unlimited” on the service

  • - -
  • I thought it might be open file limit, but it seems we’re nowhere near the current limit of 16384:

    - +
    # lsof -u dspace | wc -l
     3016
    -
  • - -
  • For what it’s worth I see the same errors about solr_update_time_stamp on DSpace Test (linode19)

  • - -
  • Update DSpace Test to Tomcat 7.0.93

  • - -
  • Something seems to have happened (some Atmire scheduled task, perhaps the CUA one at 7AM?) on CGSpace because I checked a few communities and collections on CGSpace and there are now statistics for January and February

  • + - -

    CGSpace statlets working again

    - +

    CGSpace statlets working again

    - -

    2019-02-26

    - +

    2019-02-26

    - -

    2019-02-27

    - +
    Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1021 by user 3049
    +
    +

    2019-02-27

    + +
  • He asked me to upload the files for him via the command line, but the file he referenced (Thumbnails_feb_2019.zip) doesn't exist
  • +
  • I noticed that the command line batch import functionality is a bit weird when using zip files because you have to specify the directory where the zip file is location as well as the zip file's name:
  • +
    $ ~/dspace/bin/dspace import -a -e aorth@stfu.com -m mapfile -s /home/aorth/Downloads/2019-02-27-test/ -z SimpleArchiveFormat.zip
    -
    - -
  • Why don’t they just derive the directory from the path to the zip file?

  • - -
  • Working on Udana’s Restoring Degraded Landscapes (RDL) WLE records that we originally started in 2018-11 and fixing many of the same problems that I originally did then

    - + - -

    2019-02-28

    - +
  • + +

    2019-02-28

    $ dspace import -a -e swebshet@stfu.org -s /home/swebshet/Thumbnails_feb_2019 -m 2019-02-28-CTA-Thumbnails.map
    -
    - -
  • Mails from CGSpace stopped working, looks like ICT changed the password again or we got locked out sigh

  • - -
  • Now I’m getting this message when trying to use DSpace’s test-email script:

    - +
    $ dspace test-email
     
     About to send test email:
    -- To: stfu@google.com
    -- Subject: DSpace test email
    -- Server: smtp.office365.com
    + - To: stfu@google.com
    + - Subject: DSpace test email
    + - Server: smtp.office365.com
     
     Error sending email:
    -- Error: javax.mail.AuthenticationFailedException
    + - Error: javax.mail.AuthenticationFailedException
     
     Please see the DSpace documentation for assistance.
    -
  • - -
  • I’ve tried to log in with the last two passwords that ICT reset it to earlier this month, but they are not working

  • - -
  • I sent a mail to ILRI ICT to check if we’re locked out or reset the password again

  • + - - + diff --git a/docs/2019-03/index.html b/docs/2019-03/index.html index 0d16ddd85..0447a1533 100644 --- a/docs/2019-03/index.html +++ b/docs/2019-03/index.html @@ -8,11 +8,9 @@ @@ -31,11 +30,9 @@ I think I will need to ask Udana to re-copy and paste the abstracts with more ca - + @@ -126,389 +124,329 @@ I think I will need to ask Udana to re-copy and paste the abstracts with more ca

    -

    2019-03-01

    - +

    2019-03-01

    +
  • I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs
  • - -

    2019-03-03

    - +

    2019-03-03

    $ mkdir 2019-03-03-IITA-Feb14
     $ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-03-03-IITA-Feb14
    -
    - -
  • As I was inspecting the archive I noticed that there were some problems with the bitsreams:

    - + - -

    2019-03-06

    - +
  • +
  • After adding the missing bitstreams and descriptions manually I tested them again locally, then imported them to a temporary collection on CGSpace:
  • + +
    $ dspace import -a -c 10568/99832 -e aorth@stfu.com -m 2019-03-03-IITA-Feb14.map -s /tmp/2019-03-03-IITA-Feb14
    +
    +

    2019-03-06

    $ dspace test-email
     
     About to send test email:
    -- To: blah@stfu.com
    -- Subject: DSpace test email
    -- Server: smtp.office365.com
    + - To: blah@stfu.com
    + - Subject: DSpace test email
    + - Server: smtp.office365.com
     
     Error sending email:
    -- Error: javax.mail.AuthenticationFailedException
    -
    - -
  • I will send a follow-up to ICT to ask them to reset the password

  • + - Error: javax.mail.AuthenticationFailedException + - -

    2019-03-07

    - +

    2019-03-07

    $ csvcut -c name 2019-02-22-subjects.csv > dspace/config/controlled-vocabularies/dc-contributor-author.xml
     $ # apply formatting in XML file
     $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.xml
    -
    - -
  • I tested the AGROVOC controlled vocabulary locally and will deploy it on DSpace Test soon so people can see it

  • - -
  • Atmire noticed my message about the “solr_update_time_stamp” error on the dspace-tech mailing list and created an issue on their tracker to discuss it with me

    - + - -

    2019-03-08

    - +
  • + +

    2019-03-08

    # journalctl -u tomcat7 | grep -c 'Multiple update components target the same field:solr_update_time_stamp'
     1076
    -
    - - -
  • I restarted Tomcat and it’s OK now…

  • - -
  • Skype meeting with Peter and Abenet and Sisay

    - + - -

    2019-03-09

    - +
  • + +

    2019-03-09

    dspace=# SELECT text_value FROM metadatavalue WHERE resource_type_id in (3,4) AND (text_value LIKE '%http://feedburner.%' OR text_value LIKE '%http://feeds.feedburner.%');
    -
    - -
  • I can replace these globally using the following SQL:

    - +
    dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://feedburner.','https//feedburner.', 'g') WHERE resource_type_id in (3,4) AND text_value LIKE '%http://feedburner.%';
     UPDATE 43
     dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://feeds.feedburner.','https//feeds.feedburner.', 'g') WHERE resource_type_id in (3,4) AND text_value LIKE '%http://feeds.feedburner.%';
     UPDATE 44
    -
  • - -
  • I ran the corrections on CGSpace and DSpace Test

  • + - -

    2019-03-10

    - +

    2019-03-10

    - -

    2019-03-11

    - +
    $ csvcut -c 'id,cg.subject.iita,cg.subject.iita[],cg.subject.iita[en],cg.subject.iita[en_US]' ~/Downloads/10568-68616.csv > /tmp/iita.csv
    +
    +
    if(isBlank(value), 'PLANT PRODUCTION & HEALTH', value + '||PLANT PRODUCTION & HEALTH')
    +
    +

    2019-03-11

    - -

    2019-03-12

    - +

    2019-03-12

    - -

    2019-03-14

    - +

    2019-03-14

    + +
  • This is a bit ugly, but it works (using the DSpace 5 SQL helper function to resolve ID to handle):
  • +
    for id in $(psql -U postgres -d dspacetest -h localhost -c "SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228 AND text_value LIKE '%SWAZILAND%'" | grep -oE '[0-9]{3,}'); do
     
    -echo "Getting handle for id: ${id}"
    +    echo "Getting handle for id: ${id}"
     
    -handle=$(psql -U postgres -d dspacetest -h localhost -c "SELECT ds5_item2itemhandle($id)" | grep -oE '[0-9]{5}/[0-9]+')
    +    handle=$(psql -U postgres -d dspacetest -h localhost -c "SELECT ds5_item2itemhandle($id)" | grep -oE '[0-9]{5}/[0-9]+')
     
    -~/dspace/bin/dspace metadata-export -f /tmp/${id}.csv -i $handle
    +    ~/dspace/bin/dspace metadata-export -f /tmp/${id}.csv -i $handle
     
     done
    -
    - -
  • Then I couldn’t figure out a clever way to join all the CSVs, so I just grepped them to find the IDs with dates from 2018 and 2019 and there are apparently only three:

    - +
    $ grep -oE '201[89]' /tmp/*.csv | sort -u
     /tmp/94834.csv:2018
     /tmp/95615.csv:2018
     /tmp/96747.csv:2018
    -
  • - -
  • And looking at those items more closely, only one of them has an issue date of after 2018-04, so I will only update that one (as the countrie’s name only changed in 2018-04)

  • - -
  • Run all system updates and reboot linode20

  • - -
  • Follow up with Felix from Earlham to see if he’s done testing DSpace Test with COPO so I can re-sync the server from CGSpace

  • + - -

    2019-03-15

    - +

    2019-03-15

    2019-03-15 14:09:32,685 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
     java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@55ba10b5 is closed.
    -    at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.checkOpen(DelegatingConnection.java:398)
    -    at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.prepareStatement(DelegatingConnection.java:279)
    -    at org.apache.tomcat.dbcp.dbcp.PoolingDataSource$PoolGuardConnectionWrapper.prepareStatement(PoolingDataSource.java:313)
    -    at org.dspace.storage.rdbms.DatabaseManager.queryTable(DatabaseManager.java:220)
    -    at org.dspace.authorize.AuthorizeManager.getPolicies(AuthorizeManager.java:612)
    -    at org.dspace.content.crosswalk.METSRightsCrosswalk.disseminateElement(METSRightsCrosswalk.java:154)
    -    at org.dspace.content.crosswalk.METSRightsCrosswalk.disseminateElement(METSRightsCrosswalk.java:300)
    -
    - -
  • Interestingly, I see a pattern of these errors increasing, with single and double digit numbers over the past month, but spikes of over 1,000 today, yesterday, and on 2019-03-08, which was exactly the first time we saw this blank page error recently

    - + at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.checkOpen(DelegatingConnection.java:398) + at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.prepareStatement(DelegatingConnection.java:279) + at org.apache.tomcat.dbcp.dbcp.PoolingDataSource$PoolGuardConnectionWrapper.prepareStatement(PoolingDataSource.java:313) + at org.dspace.storage.rdbms.DatabaseManager.queryTable(DatabaseManager.java:220) + at org.dspace.authorize.AuthorizeManager.getPolicies(AuthorizeManager.java:612) + at org.dspace.content.crosswalk.METSRightsCrosswalk.disseminateElement(METSRightsCrosswalk.java:154) + at org.dspace.content.crosswalk.METSRightsCrosswalk.disseminateElement(METSRightsCrosswalk.java:300) +
    $ grep -I 'SQL QueryTable Error' dspace.log.2019-0* | awk -F: '{print $1}' | sort | uniq -c | tail -n 25
    -  5 dspace.log.2019-02-27
    - 11 dspace.log.2019-02-28
    - 29 dspace.log.2019-03-01
    - 24 dspace.log.2019-03-02
    - 41 dspace.log.2019-03-03
    - 11 dspace.log.2019-03-04
    -  9 dspace.log.2019-03-05
    - 15 dspace.log.2019-03-06
    -  7 dspace.log.2019-03-07
    -  9 dspace.log.2019-03-08
    - 22 dspace.log.2019-03-09
    - 23 dspace.log.2019-03-10
    - 18 dspace.log.2019-03-11
    - 13 dspace.log.2019-03-12
    - 10 dspace.log.2019-03-13
    - 25 dspace.log.2019-03-14
    - 12 dspace.log.2019-03-15
    - 67 dspace.log.2019-03-16
    - 72 dspace.log.2019-03-17
    -  8 dspace.log.2019-03-18
    - 15 dspace.log.2019-03-19
    - 21 dspace.log.2019-03-20
    - 29 dspace.log.2019-03-21
    - 41 dspace.log.2019-03-22
    -4807 dspace.log.2019-03-23
    -
  • - -
  • (Update on 2019-03-23 to use correct grep query)

  • - -
  • There are not too many connections currently in PostgreSQL:

    - + 5 dspace.log.2019-02-27 + 11 dspace.log.2019-02-28 + 29 dspace.log.2019-03-01 + 24 dspace.log.2019-03-02 + 41 dspace.log.2019-03-03 + 11 dspace.log.2019-03-04 + 9 dspace.log.2019-03-05 + 15 dspace.log.2019-03-06 + 7 dspace.log.2019-03-07 + 9 dspace.log.2019-03-08 + 22 dspace.log.2019-03-09 + 23 dspace.log.2019-03-10 + 18 dspace.log.2019-03-11 + 13 dspace.log.2019-03-12 + 10 dspace.log.2019-03-13 + 25 dspace.log.2019-03-14 + 12 dspace.log.2019-03-15 + 67 dspace.log.2019-03-16 + 72 dspace.log.2019-03-17 + 8 dspace.log.2019-03-18 + 15 dspace.log.2019-03-19 + 21 dspace.log.2019-03-20 + 29 dspace.log.2019-03-21 + 41 dspace.log.2019-03-22 + 4807 dspace.log.2019-03-23 +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    -  6 dspaceApi
    - 10 dspaceCli
    - 15 dspaceWeb
    -
  • - -
  • I didn’t see anything interesting in the PostgreSQL logs, though this stack trace from the Tomcat logs (in the systemd journal) from earlier today might be related?

    - + 6 dspaceApi + 10 dspaceCli + 15 dspaceWeb +
    SEVERE: Servlet.service() for servlet [spring] in context with path [] threw exception [org.springframework.web.util.NestedServletException: Request processing failed; nested exception is java.util.EmptyStackException] with root cause
     java.util.EmptyStackException
    -    at java.util.Stack.peek(Stack.java:102)
    -    at java.util.Stack.pop(Stack.java:84)
    -    at org.apache.cocoon.callstack.CallStack.leave(CallStack.java:54)
    -    at org.apache.cocoon.servletservice.CallStackHelper.leaveServlet(CallStackHelper.java:85)
    -    at org.apache.cocoon.servletservice.ServletServiceContext$PathDispatcher.forward(ServletServiceContext.java:484)
    -    at org.apache.cocoon.servletservice.ServletServiceContext$PathDispatcher.forward(ServletServiceContext.java:443)
    -    at org.apache.cocoon.servletservice.spring.ServletFactoryBean$ServiceInterceptor.invoke(ServletFactoryBean.java:264)
    -    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
    -    at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:202)
    -    at com.sun.proxy.$Proxy90.service(Unknown Source)
    -    at org.dspace.springmvc.CocoonView.render(CocoonView.java:113)
    -    at org.springframework.web.servlet.DispatcherServlet.render(DispatcherServlet.java:1180)
    -    at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:950)
    -    at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:852)
    -    at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:882)
    -    at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:778)
    -    at javax.servlet.http.HttpServlet.service(HttpServlet.java:624)
    -    at javax.servlet.http.HttpServlet.service(HttpServlet.java:731)
    -    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)
    -    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    -    at org.dspace.rdf.negotiation.NegotiationFilter.doFilter(NegotiationFilter.java:59)
    -    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    -    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    -    at org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:78)
    -    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    -    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    -    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:219)
    -    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:110)
    -    at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:494)
    -    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
    -    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:104)
    -    at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:234)
    -    at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:1025)
    -    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
    -    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:445)
    -    at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1137)
    -    at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:637)
    -    at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:317)
    -    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    -    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    -    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    -    at java.lang.Thread.run(Thread.java:748)
    -
  • - -
  • For now I will just restart Tomcat…

  • + at java.util.Stack.peek(Stack.java:102) + at java.util.Stack.pop(Stack.java:84) + at org.apache.cocoon.callstack.CallStack.leave(CallStack.java:54) + at org.apache.cocoon.servletservice.CallStackHelper.leaveServlet(CallStackHelper.java:85) + at org.apache.cocoon.servletservice.ServletServiceContext$PathDispatcher.forward(ServletServiceContext.java:484) + at org.apache.cocoon.servletservice.ServletServiceContext$PathDispatcher.forward(ServletServiceContext.java:443) + at org.apache.cocoon.servletservice.spring.ServletFactoryBean$ServiceInterceptor.invoke(ServletFactoryBean.java:264) + at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172) + at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:202) + at com.sun.proxy.$Proxy90.service(Unknown Source) + at org.dspace.springmvc.CocoonView.render(CocoonView.java:113) + at org.springframework.web.servlet.DispatcherServlet.render(DispatcherServlet.java:1180) + at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:950) + at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:852) + at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:882) + at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:778) + at javax.servlet.http.HttpServlet.service(HttpServlet.java:624) + at javax.servlet.http.HttpServlet.service(HttpServlet.java:731) + at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303) + at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) + at org.dspace.rdf.negotiation.NegotiationFilter.doFilter(NegotiationFilter.java:59) + at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) + at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) + at org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:78) + at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) + at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) + at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:219) + at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:110) + at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:494) + at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169) + at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:104) + at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:234) + at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:1025) + at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116) + at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:445) + at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1137) + at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:637) + at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:317) + at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) + at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) + at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) + at java.lang.Thread.run(Thread.java:748) + - -

    2019-03-17

    - +

    2019-03-17

    +
  • Create and merge pull request for the AGROVOC controlled list (#415) -
  • + +
  • Re-sync DSpace Test with a fresh database snapshot and assetstore from CGSpace -
  • +
  • I'm not entirely sure if it's related, but I tried to delete the old migrations and then force running the ignored ones like when we upgraded to DSpace 5.8 in 2018-06 and then after restarting Tomcat I could see the item displays again
  • + +
  • I copied the 2019 Solr statistics core from CGSpace to DSpace Test and it works (and is only 5.5GB currently), so now we have some useful stats on DSpace Test for the CUA module and the dspace-statistics-api
  • - -
  • I ran DSpace’s cleanup task on CGSpace (linode18) and there were errors:

    - +
  • I ran DSpace's cleanup task on CGSpace (linode18) and there were errors:
  • +
    $ dspace cleanup -v
     Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -Detail: Key (bitstream_id)=(164496) is still referenced from table "bundle".
    -
    - -
  • The solution is, as always:

    - + Detail: Key (bitstream_id)=(164496) is still referenced from table "bundle". +
    # su - postgres
     $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (164496);'
     UPDATE 1
    -
  • - - -

    2019-03-18

    - +

    2019-03-18

    + +
  • Dump top 1500 subjects from CGSpace to try one more time to generate a list of invalid terms using my agrovoc-lookup.py script:
  • +
    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-03-18-top-1500-subject.csv WITH CSV HEADER;
     COPY 1500
     dspace=# \q
    @@ -523,162 +461,140 @@ $ sort -u 2019-03-18-top-1500-subject.csv > /tmp/1500-subjects-sorted.txt
     $ comm -13 /tmp/subjects-matched-sorted.txt /tmp/1500-subjects-sorted.txt > 2019-03-18-subjects-unmatched.txt
     $ wc -l 2019-03-18-subjects-unmatched.txt
     182 2019-03-18-subjects-unmatched.txt
    -
    - -
  • So the new total of matched terms with the updated regex is 1317 and unmatched is 183 (previous number of matched terms was 1187)

  • - -
  • Create and merge a pull request to update the controlled vocabulary for AGROVOC terms (#416)

  • - -
  • We are getting the blank page issue on CGSpace again today and I see a large number of the “SQL QueryTable Error” in the DSpace log again (last time was 2019-03-15):

    - +
    $ grep -c 'SQL QueryTable Error' dspace.log.2019-03-1[5678]
     dspace.log.2019-03-15:929
     dspace.log.2019-03-16:67
     dspace.log.2019-03-17:72
     dspace.log.2019-03-18:1038
    -
  • - -
  • Though WTF, this grep seems to be giving weird inaccurate results actually, and the real number of errors is much lower if I exclude the “binary file matches” result with -I:

    - +
    $ grep -I 'SQL QueryTable Error' dspace.log.2019-03-18 | wc -l
     8
     $ grep -I 'SQL QueryTable Error' dspace.log.2019-03-{08,14,15,16,17,18} | awk -F: '{print $1}' | sort | uniq -c
    -  9 dspace.log.2019-03-08
    - 25 dspace.log.2019-03-14
    - 12 dspace.log.2019-03-15
    - 67 dspace.log.2019-03-16
    - 72 dspace.log.2019-03-17
    -  8 dspace.log.2019-03-18
    -
  • - -
  • It seems to be something with grep doing binary matching on some log files for some reason, so I guess I need to always use -I to say binary files don’t match

  • - -
  • Anyways, the full error in DSpace’s log is:

    - + 9 dspace.log.2019-03-08 + 25 dspace.log.2019-03-14 + 12 dspace.log.2019-03-15 + 67 dspace.log.2019-03-16 + 72 dspace.log.2019-03-17 + 8 dspace.log.2019-03-18 +
    2019-03-18 12:26:23,331 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error - 
     java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@75eaa668 is closed.
    -    at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.checkOpen(DelegatingConnection.java:398)
    -    at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.prepareStatement(DelegatingConnection.java:279)
    -    at org.apache.tomcat.dbcp.dbcp.PoolingDataSource$PoolGuardConnectionWrapper.prepareStatement(PoolingDataSource.java:313)
    -    at org.dspace.storage.rdbms.DatabaseManager.queryTable(DatabaseManager.java:220)
    -
  • - -
  • There is a low number of connections to PostgreSQL currently:

    - + at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.checkOpen(DelegatingConnection.java:398) + at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.prepareStatement(DelegatingConnection.java:279) + at org.apache.tomcat.dbcp.dbcp.PoolingDataSource$PoolGuardConnectionWrapper.prepareStatement(PoolingDataSource.java:313) + at org.dspace.storage.rdbms.DatabaseManager.queryTable(DatabaseManager.java:220) +
    $ psql -c 'select * from pg_stat_activity' | wc -l
     33
     $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    -  6 dspaceApi
    -  7 dspaceCli
    - 15 dspaceWeb
    -
  • - -
  • I looked in the PostgreSQL logs, but all I see are a bunch of these errors going back two months to January:

    - -
    2019-01-13 06:25:13.062 CET [9157] postgres@template1 ERROR:  column "waiting" does not exist at character 217
    -
  • - -
  • This is unrelated and apparently due to Munin checking a column that was changed in PostgreSQL 9.6

  • - -
  • I suspect that this issue with the blank pages might not be PostgreSQL after all, perhaps it’s a Cocoon thing?

  • - -
  • Looking in the cocoon logs I see a large number of warnings about “Can not load requested doc” around 11AM and 12PM:

    - -
    $ grep 'Can not load requested doc' cocoon.log.2019-03-18 | grep -oE '2019-03-18 [0-9]{2}:' | sort | uniq -c
    -  2 2019-03-18 00:
    -  6 2019-03-18 02:
    -  3 2019-03-18 04:
    -  1 2019-03-18 05:
    -  1 2019-03-18 07:
    -  2 2019-03-18 08:
    -  4 2019-03-18 09:
    -  5 2019-03-18 10:
    -863 2019-03-18 11:
    -203 2019-03-18 12:
    - 14 2019-03-18 13:
    -  1 2019-03-18 14:
    -
  • - -
  • And a few days ago on 2019-03-15 when I happened last it was in the afternoon when it happened and the same pattern occurs then around 1–2PM:

    - -
    $ xzgrep 'Can not load requested doc' cocoon.log.2019-03-15.xz | grep -oE '2019-03-15 [0-9]{2}:' | sort | uniq -c
    -  4 2019-03-15 01:
    -  3 2019-03-15 02:
    -  1 2019-03-15 03:
    - 13 2019-03-15 04:
    -  1 2019-03-15 05:
    -  2 2019-03-15 06:
    -  3 2019-03-15 07:
    - 27 2019-03-15 09:
    -  9 2019-03-15 10:
    -  3 2019-03-15 11:
    -  2 2019-03-15 12:
    -531 2019-03-15 13:
    -274 2019-03-15 14:
    -  4 2019-03-15 15:
    - 75 2019-03-15 16:
    -  5 2019-03-15 17:
    -  5 2019-03-15 18:
    -  6 2019-03-15 19:
    -  2 2019-03-15 20:
    -  4 2019-03-15 21:
    -  3 2019-03-15 22:
    -  1 2019-03-15 23:
    -
  • - -
  • And again on 2019-03-08, surprise surprise, it happened in the morning:

    - -
    $ xzgrep 'Can not load requested doc' cocoon.log.2019-03-08.xz | grep -oE '2019-03-08 [0-9]{2}:' | sort | uniq -c
    - 11 2019-03-08 01:
    -  3 2019-03-08 02:
    -  1 2019-03-08 03:
    -  2 2019-03-08 04:
    -  1 2019-03-08 05:
    -  1 2019-03-08 06:
    -  1 2019-03-08 08:
    -425 2019-03-08 09:
    -432 2019-03-08 10:
    -717 2019-03-08 11:
    - 59 2019-03-08 12:
    -
  • - -
  • I’m not sure if it’s cocoon or that’s just a symptom of something else

  • + 6 dspaceApi + 7 dspaceCli + 15 dspaceWeb + - -

    2019-03-19

    - -
    # systemctl stop tomcat7
     # find /home/cgspace.cgiar.org/solr/ -iname "*.lock" -delete
     # systemctl start tomcat7
    -
    - -
  • After restarting I confirmed that all Solr statistics cores were loaded successfully…

  • - -
  • Another avenue might be to look at point releases in Solr 4.10.x, as we’re running 4.10.2 and they released 4.10.3 and 4.10.4 back in 2014 or 2015

    - + - -

    2019-03-20

    - +
  • +
  • I sent a mail to the dspace-tech mailing list to ask about Solr issues
  • +
  • Testing Solr 4.10.4 on DSpace 5.8: + +
  • + +

    2019-03-20

    - -

    2019-03-21

    - + + +

    2019-03-21

    $ grep 'Can not load requested doc' cocoon.log.2019-03-20 | grep -oE '2019-03-20 [0-9]{2}:' | sort | uniq -c
    -  3 2019-03-20 00:
    - 12 2019-03-20 02:
    +      3 2019-03-20 00:
    +     12 2019-03-20 02:
     $ grep 'Can not load requested doc' cocoon.log.2019-03-21 | grep -oE '2019-03-21 [0-9]{2}:' | sort | uniq -c
    -  4 2019-03-21 00:
    -  1 2019-03-21 02:
    -  4 2019-03-21 03:
    -  1 2019-03-21 05:
    -  4 2019-03-21 06:
    - 11 2019-03-21 07:
    - 14 2019-03-21 08:
    -  3 2019-03-21 09:
    -  4 2019-03-21 10:
    -  5 2019-03-21 11:
    -  4 2019-03-21 12:
    -  3 2019-03-21 13:
    -  6 2019-03-21 14:
    -  2 2019-03-21 15:
    -  3 2019-03-21 16:
    -  3 2019-03-21 18:
    -  1 2019-03-21 19:
    -  6 2019-03-21 20:
    -
    - -
  • To investigate the Solr lock issue I added a find command to the Tomcat 7 service with ExecStartPre and ExecStopPost and noticed that the lock files are always there…

    - + 4 2019-03-21 00: + 1 2019-03-21 02: + 4 2019-03-21 03: + 1 2019-03-21 05: + 4 2019-03-21 06: + 11 2019-03-21 07: + 14 2019-03-21 08: + 3 2019-03-21 09: + 4 2019-03-21 10: + 5 2019-03-21 11: + 4 2019-03-21 12: + 3 2019-03-21 13: + 6 2019-03-21 14: + 2 2019-03-21 15: + 3 2019-03-21 16: + 3 2019-03-21 18: + 1 2019-03-21 19: + 6 2019-03-21 20: + +
  • +
  • In other news, I notice that that systemd always thinks that Tomcat has failed when it stops because the JVM exits with code 143, which is apparently normal when processes gracefully receive a SIGTERM (128 + 15 == 143)
  • - -

    2019-03-22

    - + + +

    2019-03-22

    - -

    2019-03-23

    - +

    2019-03-23

    $ grep 'Can not load requested doc' cocoon.log.2019-03-22 | grep -oE '2019-03-22 [0-9]{2}:' | sort | uniq -c
    -  2 2019-03-22 00:
    - 69 2019-03-22 01:
    -  1 2019-03-22 02:
    - 13 2019-03-22 03:
    -  2 2019-03-22 05:
    -  2 2019-03-22 06:
    -  8 2019-03-22 07:
    -  4 2019-03-22 08:
    - 12 2019-03-22 09:
    -  7 2019-03-22 10:
    -  1 2019-03-22 11:
    -  2 2019-03-22 12:
    - 14 2019-03-22 13:
    -  4 2019-03-22 15:
    -  7 2019-03-22 16:
    -  7 2019-03-22 17:
    -  3 2019-03-22 18:
    -  3 2019-03-22 19:
    -  7 2019-03-22 20:
    -323 2019-03-22 21:
    -685 2019-03-22 22:
    -357 2019-03-22 23:
    +      2 2019-03-22 00:
    +     69 2019-03-22 01:
    +      1 2019-03-22 02:
    +     13 2019-03-22 03:
    +      2 2019-03-22 05:
    +      2 2019-03-22 06:
    +      8 2019-03-22 07:
    +      4 2019-03-22 08:
    +     12 2019-03-22 09:
    +      7 2019-03-22 10:
    +      1 2019-03-22 11:
    +      2 2019-03-22 12:
    +     14 2019-03-22 13:
    +      4 2019-03-22 15:
    +      7 2019-03-22 16:
    +      7 2019-03-22 17:
    +      3 2019-03-22 18:
    +      3 2019-03-22 19:
    +      7 2019-03-22 20:
    +    323 2019-03-22 21:
    +    685 2019-03-22 22:
    +    357 2019-03-22 23:
     $ grep 'Can not load requested doc' cocoon.log.2019-03-23 | grep -oE '2019-03-23 [0-9]{2}:' | sort | uniq -c
    -575 2019-03-23 00:
    -445 2019-03-23 01:
    -518 2019-03-23 02:
    -436 2019-03-23 03:
    -387 2019-03-23 04:
    -593 2019-03-23 05:
    -468 2019-03-23 06:
    -541 2019-03-23 07:
    -440 2019-03-23 08:
    -260 2019-03-23 09:
    -
    - -
  • I was curious to see if clearing the Cocoon cache in the XMLUI control panel would fix it, but it didn’t

  • - -
  • Trying to drill down more, I see that the bulk of the errors started aroundi 21:20:

    - + 575 2019-03-23 00: + 445 2019-03-23 01: + 518 2019-03-23 02: + 436 2019-03-23 03: + 387 2019-03-23 04: + 593 2019-03-23 05: + 468 2019-03-23 06: + 541 2019-03-23 07: + 440 2019-03-23 08: + 260 2019-03-23 09: +
    $ grep 'Can not load requested doc' cocoon.log.2019-03-22 | grep -oE '2019-03-22 21:[0-9]' | sort | uniq -c
    -  1 2019-03-22 21:0
    -  1 2019-03-22 21:1
    - 59 2019-03-22 21:2
    - 69 2019-03-22 21:3
    - 89 2019-03-22 21:4
    -104 2019-03-22 21:5
    -
  • - -
  • Looking at the Cocoon log around that time I see the full error is:

    - + 1 2019-03-22 21:0 + 1 2019-03-22 21:1 + 59 2019-03-22 21:2 + 69 2019-03-22 21:3 + 89 2019-03-22 21:4 + 104 2019-03-22 21:5 +
    2019-03-22 21:21:34,378 WARN  org.apache.cocoon.components.xslt.TraxErrorListener  - Can not load requested doc: unknown protocol: cocoon at jndi:/localhost/themes/CIAT/xsl/../../0_CGIAR/xsl//aspect/artifactbrowser/common.xsl:141:90
    -
  • - -
  • A few milliseconds before that time I see this in the DSpace log:

    - +
    2019-03-22 21:21:34,356 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
     org.postgresql.util.PSQLException: This statement has been closed.
    -    at org.postgresql.jdbc.PgStatement.checkClosed(PgStatement.java:694)
    -    at org.postgresql.jdbc.PgStatement.getMaxRows(PgStatement.java:501)
    -    at org.postgresql.jdbc.PgStatement.createResultSet(PgStatement.java:153)
    -    at org.postgresql.jdbc.PgStatement$StatementResultHandler.handleResultRows(PgStatement.java:204)
    -    at org.postgresql.core.ResultHandlerDelegate.handleResultRows(ResultHandlerDelegate.java:29)
    -    at org.postgresql.core.v3.QueryExecutorImpl$1.handleResultRows(QueryExecutorImpl.java:528)
    -    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2120)
    -    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:308)
    -    at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:441)
    -    at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:365)
    -    at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:143)
    -    at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:106)
    -    at org.apache.tomcat.dbcp.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
    -    at org.apache.tomcat.dbcp.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
    -    at org.dspace.storage.rdbms.DatabaseManager.queryTable(DatabaseManager.java:224)
    -    at org.dspace.storage.rdbms.DatabaseManager.querySingleTable(DatabaseManager.java:375)
    -    at org.dspace.storage.rdbms.DatabaseManager.findByUnique(DatabaseManager.java:544)
    -    at org.dspace.storage.rdbms.DatabaseManager.find(DatabaseManager.java:501)
    -    at org.dspace.eperson.Group.find(Group.java:706)
    +        at org.postgresql.jdbc.PgStatement.checkClosed(PgStatement.java:694)
    +        at org.postgresql.jdbc.PgStatement.getMaxRows(PgStatement.java:501)
    +        at org.postgresql.jdbc.PgStatement.createResultSet(PgStatement.java:153)
    +        at org.postgresql.jdbc.PgStatement$StatementResultHandler.handleResultRows(PgStatement.java:204)
    +        at org.postgresql.core.ResultHandlerDelegate.handleResultRows(ResultHandlerDelegate.java:29)
    +        at org.postgresql.core.v3.QueryExecutorImpl$1.handleResultRows(QueryExecutorImpl.java:528)
    +        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2120)
    +        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:308)
    +        at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:441)
    +        at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:365)
    +        at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:143)
    +        at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:106)
    +        at org.apache.tomcat.dbcp.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
    +        at org.apache.tomcat.dbcp.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
    +        at org.dspace.storage.rdbms.DatabaseManager.queryTable(DatabaseManager.java:224)
    +        at org.dspace.storage.rdbms.DatabaseManager.querySingleTable(DatabaseManager.java:375)
    +        at org.dspace.storage.rdbms.DatabaseManager.findByUnique(DatabaseManager.java:544)
    +        at org.dspace.storage.rdbms.DatabaseManager.find(DatabaseManager.java:501)
    +        at org.dspace.eperson.Group.find(Group.java:706)
     ...
     2019-03-22 21:21:34,381 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL query singleTable Error -
     org.postgresql.util.PSQLException: This statement has been closed.
    -    at org.postgresql.jdbc.PgStatement.checkClosed(PgStatement.java:694)
    -    at org.postgresql.jdbc.PgStatement.getMaxRows(PgStatement.java:501)
    -    at org.postgresql.jdbc.PgStatement.createResultSet(PgStatement.java:153)
    +        at org.postgresql.jdbc.PgStatement.checkClosed(PgStatement.java:694)
    +        at org.postgresql.jdbc.PgStatement.getMaxRows(PgStatement.java:501)
    +        at org.postgresql.jdbc.PgStatement.createResultSet(PgStatement.java:153)
     ...
     2019-03-22 21:21:34,386 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL findByUnique Error -
     org.postgresql.util.PSQLException: This statement has been closed.
    -    at org.postgresql.jdbc.PgStatement.checkClosed(PgStatement.java:694)
    -    at org.postgresql.jdbc.PgStatement.getMaxRows(PgStatement.java:501)
    -    at org.postgresql.jdbc.PgStatement.createResultSet(PgStatement.java:153)
    +        at org.postgresql.jdbc.PgStatement.checkClosed(PgStatement.java:694)
    +        at org.postgresql.jdbc.PgStatement.getMaxRows(PgStatement.java:501)
    +        at org.postgresql.jdbc.PgStatement.createResultSet(PgStatement.java:153)
     ...
     2019-03-22 21:21:34,395 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL find Error -
     org.postgresql.util.PSQLException: This statement has been closed.
    -    at org.postgresql.jdbc.PgStatement.checkClosed(PgStatement.java:694)
    -    at org.postgresql.jdbc.PgStatement.getMaxRows(PgStatement.java:501)
    -    at org.postgresql.jdbc.PgStatement.createResultSet(PgStatement.java:153)
    -    at org.postgresql.jdbc.PgStatement$StatementResultHandler.handleResultRows(PgStatement.java:204)
    -
  • - -
  • I restarted Tomcat and now the item displays are working again for now

  • - -
  • I am wondering if this is an issue with removing abandoned connections in Tomcat’s JDBC pooling?

    - + at org.postgresql.jdbc.PgStatement.checkClosed(PgStatement.java:694) + at org.postgresql.jdbc.PgStatement.getMaxRows(PgStatement.java:501) + at org.postgresql.jdbc.PgStatement.createResultSet(PgStatement.java:153) + at org.postgresql.jdbc.PgStatement$StatementResultHandler.handleResultRows(PgStatement.java:204) + +
  • +
  • +

    I sent another mail to the dspace-tech mailing list with my observations

    +
  • +
  • +

    I spent some time trying to test and debug the Tomcat connection pool's settings, but for some reason our logs are either messed up or no connections are actually getting abandoned

    +
  • +
  • +

    I compiled this TomcatJdbcConnectionTest and created a bunch of database connections and waited a few minutes but they never got abandoned until I created over maxActive (75), after which almost all were purged at once

  • - -

    2019-03-24

    - + + +

    2019-03-24

    $ jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=3000 service:jmx:rmi:///jndi/rmi://localhost:5400/jmxrmi -J-DsocksNonProxyHosts=
    -
    - -
  • I need to remember to check the active connections next time we have issues with blank item pages on CGSpace

  • - -
  • In other news, I’ve been running G1GC on DSpace Test (linode19) since 2018-11-08 without realizing it, which is probably a good thing

  • - -
  • I deployed the latest 5_x-prod branch on CGSpace (linode18) and added more validation to the JDBC pool in our Tomcat config

    - + - -

    2019-03-25

    - +
  • +
  • I spent one hour looking at the invalid AGROVOC terms from last week + +
  • + +

    2019-03-25

    +
  • Looking at the DBCP status on CGSpace via jconsole and everything looks good, though I wonder why timeBetweenEvictionRunsMillis is -1, because the Tomcat 7.0 JDBC docs say the default is 5000… -
  • - -
  • Also, CGSpace doesn’t have many Cocoon errors yet this morning:

    - + +
  • +
  • Also, CGSpace doesn't have many Cocoon errors yet this morning:
  • +
    $ grep 'Can not load requested doc' cocoon.log.2019-03-25 | grep -oE '2019-03-25 [0-9]{2}:' | sort | uniq -c
    -  4 2019-03-25 00:
    -  1 2019-03-25 01:
    -
    - -
  • Holy shit I just realized we’ve been using the wrong DBCP pool in Tomcat

    - + 4 2019-03-25 00: + 1 2019-03-25 01: + +
  • +
  • Uptime Robot reported that CGSpace went down and I see the load is very high
  • +
  • The top IPs around the time in the nginx API and web logs were:
  • +
    # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "25/Mar/2019:(18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -  9 190.252.43.162
    - 12 157.55.39.140
    - 18 157.55.39.54
    - 21 66.249.66.211
    - 27 40.77.167.185
    - 29 138.220.87.165
    - 30 157.55.39.168
    - 36 157.55.39.9
    - 50 52.23.239.229
    -2380 45.5.186.2
    +      9 190.252.43.162
    +     12 157.55.39.140
    +     18 157.55.39.54
    +     21 66.249.66.211
    +     27 40.77.167.185
    +     29 138.220.87.165
    +     30 157.55.39.168
    +     36 157.55.39.9
    +     50 52.23.239.229
    +   2380 45.5.186.2
     # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "25/Mar/2019:(18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -354 18.195.78.144
    -363 190.216.179.100
    -386 40.77.167.185
    -484 157.55.39.168
    -507 157.55.39.9
    -536 2a01:4f8:140:3192::2
    -1123 66.249.66.211
    -1186 93.179.69.74
    -1222 35.174.184.209
    -1720 2a01:4f8:13b:1296::2
    -
    - -
  • The IPs look pretty normal except we’ve never seen 93.179.69.74 before, and it uses the following user agent:

    - + 354 18.195.78.144 + 363 190.216.179.100 + 386 40.77.167.185 + 484 157.55.39.168 + 507 157.55.39.9 + 536 2a01:4f8:140:3192::2 + 1123 66.249.66.211 + 1186 93.179.69.74 + 1222 35.174.184.209 + 1720 2a01:4f8:13b:1296::2 +
    Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.20 Safari/535.1
    -
  • - -
  • Surprisingly they are re-using their Tomcat session:

    - +
    $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=93.179.69.74' dspace.log.2019-03-25 | sort | uniq | wc -l
     1
    -
  • - -
  • That’s weird because the total number of sessions today seems low compared to recent days:

    - +
    $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-25 | sort -u | wc -l
     5657
     $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-24 | sort -u | wc -l
    @@ -1020,348 +903,283 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-23 | sort -u | wc -l
     17179
     $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
     7904
    -
  • - -
  • PostgreSQL seems to be pretty busy:

    - -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    - 11 dspaceApi
    - 10 dspaceCli
    - 67 dspaceWeb
    -
  • - -
  • I restarted Tomcat and deployed the new Tomcat JDBC settings on CGSpace since I had to restart the server anyways

    - -
  • - -
  • According the Uptime Robot the server was up and down a few more times over the next hour so I restarted Tomcat again

  • + - -

    2019-03-26

    - +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +     11 dspaceApi
    +     10 dspaceCli
    +     67 dspaceWeb
    +
    +

    2019-03-26

    # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "26/Mar/2019:(06|07)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 
    -  3 35.174.184.209
    -  3 66.249.66.81
    -  4 104.198.9.108
    -  4 154.77.98.122
    -  4 2.50.152.13
    - 10 196.188.12.245
    - 14 66.249.66.80
    -414 45.5.184.72
    -535 45.5.186.2
    -2014 205.186.128.185
    +      3 35.174.184.209
    +      3 66.249.66.81
    +      4 104.198.9.108
    +      4 154.77.98.122
    +      4 2.50.152.13
    +     10 196.188.12.245
    +     14 66.249.66.80
    +    414 45.5.184.72
    +    535 45.5.186.2
    +   2014 205.186.128.185
     # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "26/Mar/2019:(06|07)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -157 41.204.190.40
    -160 18.194.46.84
    -160 54.70.40.11
    -168 31.6.77.23
    -188 66.249.66.81
    -284 3.91.79.74
    -405 2a01:4f8:140:3192::2
    -471 66.249.66.80
    -712 35.174.184.209
    -784 2a01:4f8:13b:1296::2
    -
    - -
  • The two IPV6 addresses are something called BLEXBot, which seems to check the robots.txt file and the completely ignore it by making thousands of requests to dynamic pages like Browse and Discovery

  • - -
  • Then 35.174.184.209 is MauiBot, which does the same thing

  • - -
  • Also 3.91.79.74 does, which appears to be CCBot

  • - -
  • I will add these three to the “bad bot” rate limiting that I originally used for Baidu

  • - -
  • Going further, these are the IPs making requests to Discovery and Browse pages so far today:

    - + 157 41.204.190.40 + 160 18.194.46.84 + 160 54.70.40.11 + 168 31.6.77.23 + 188 66.249.66.81 + 284 3.91.79.74 + 405 2a01:4f8:140:3192::2 + 471 66.249.66.80 + 712 35.174.184.209 + 784 2a01:4f8:13b:1296::2 +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "(discover|browse)" | grep -E "26/Mar/2019:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -120 34.207.146.166
    -128 3.91.79.74
    -132 108.179.57.67
    -143 34.228.42.25
    -185 216.244.66.198
    -430 54.70.40.11
    -1033 93.179.69.74
    -1206 2a01:4f8:140:3192::2
    -2678 2a01:4f8:13b:1296::2
    -3790 35.174.184.209
    -
  • - -
  • 54.70.40.11 is SemanticScholarBot

  • - -
  • 216.244.66.198 is DotBot

  • - -
  • 93.179.69.74 is some IP in Ukraine, which I will add to the list of bot IPs in nginx

  • - -
  • I can only hope that this helps the load go down because all this traffic is disrupting the service for normal users and well-behaved bots (and interrupting my dinner and breakfast)

  • - -
  • Looking at the database usage I’m wondering why there are so many connections from the DSpace CLI:

    - + 120 34.207.146.166 + 128 3.91.79.74 + 132 108.179.57.67 + 143 34.228.42.25 + 185 216.244.66.198 + 430 54.70.40.11 + 1033 93.179.69.74 + 1206 2a01:4f8:140:3192::2 + 2678 2a01:4f8:13b:1296::2 + 3790 35.174.184.209 +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    -   5 dspaceApi
    - 10 dspaceCli
    - 13 dspaceWeb
    -
  • - -
  • Looking closer I see they are all idle… so at least I know the load isn’t coming from some background nightly task or something

  • - -
  • Make a minor edit to my agrovoc-lookup.py script to match subject terms with parentheses like COCOA (PLANT)

  • - -
  • Test 89 corrections and 79 deletions for AGROVOC subject terms from the ones I cleaned up in the last week

    - + 5 dspaceApi + 10 dspaceCli + 13 dspaceWeb +
    $ ./fix-metadata-values.py -i /tmp/2019-03-26-AGROVOC-89-corrections.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d -n
     $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d -n
    -
  • - -
  • UptimeRobot says CGSpace is down again, but it seems to just be slow, as the load is over 10.0

  • - -
  • Looking at the nginx logs I don’t see anything terribly abusive, but SemrushBot has made ~3,000 requests to Discovery and Browse pages today:

    - +
    # grep SemrushBot /var/log/nginx/access.log | grep -E "26/Mar/2019" | grep -E '(discover|browse)' | wc -l
     2931
    -
  • - -
  • So I’m adding it to the badbot rate limiting in nginx, and actually, I kinda feel like just blocking all user agents with “bot” in the name for a few days to see if things calm down… maybe not just yet

  • - -
  • Otherwise, these are the top users in the web and API logs the last hour (18–19):

    - +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "26/Mar/2019:(18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 
    - 54 41.216.228.158
    - 65 199.47.87.140
    - 75 157.55.39.238
    - 77 157.55.39.237
    - 89 157.55.39.236
    -100 18.196.196.108
    -128 18.195.78.144
    -277 2a01:4f8:13b:1296::2
    -291 66.249.66.80
    -328 35.174.184.209
    +     54 41.216.228.158
    +     65 199.47.87.140
    +     75 157.55.39.238
    +     77 157.55.39.237
    +     89 157.55.39.236
    +    100 18.196.196.108
    +    128 18.195.78.144
    +    277 2a01:4f8:13b:1296::2
    +    291 66.249.66.80
    +    328 35.174.184.209
     # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "26/Mar/2019:(18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -  2 2409:4066:211:2caf:3c31:3fae:2212:19cc
    -  2 35.10.204.140
    -  2 45.251.231.45
    -  2 95.108.181.88
    -  2 95.137.190.2
    -  3 104.198.9.108
    -  3 107.167.109.88
    -  6 66.249.66.80
    - 13 41.89.230.156
    -1860 45.5.184.2
    -
  • - -
  • For the XMLUI I see 18.195.78.144 and 18.196.196.108 requesting only CTA items and with no user agent

  • - -
  • They are responsible for almost 1,000 XMLUI sessions today:

    - + 2 2409:4066:211:2caf:3c31:3fae:2212:19cc + 2 35.10.204.140 + 2 45.251.231.45 + 2 95.108.181.88 + 2 95.137.190.2 + 3 104.198.9.108 + 3 107.167.109.88 + 6 66.249.66.80 + 13 41.89.230.156 + 1860 45.5.184.2 +
    $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=(18.195.78.144|18.196.196.108)' dspace.log.2019-03-26 | sort | uniq | wc -l
     937
    -
  • - -
  • I will add their IPs to the list of bot IPs in nginx so I can tag them as bots to let Tomcat’s Crawler Session Manager Valve to force them to re-use their session

  • - -
  • Another user agent behaving badly in Colombia is “GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1”

  • - -
  • I will add curl to the Tomcat Crawler Session Manager because anyone using curl is most likely an automated read-only request

  • - -
  • I will add GuzzleHttp to the nginx badbots rate limiting, because it is making requests to dynamic Discovery pages

    - +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep 45.5.184.72 | grep -E "26/Mar/2019:" | grep -E '(discover|browse)' | wc -l                                        
     119
    -
  • - -
  • What’s strange is that I can’t see any of their requests in the DSpace log…

    - +
    $ grep -I -c 45.5.184.72 dspace.log.2019-03-26 
     0
    -
  • - - -

    2019-03-28

    - +

    2019-03-28

    # grep Spore-192-EN-web.pdf /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -n
    -  1 37.48.65.147
    -  1 80.113.172.162
    -  2 108.174.5.117
    -  2 83.110.14.208
    -  4 18.196.8.188
    - 84 18.195.78.144
    -644 18.194.46.84
    -1144 18.196.196.108
    -
    - -
  • None of these 18.x.x.x IPs specify a user agent and they are all on Amazon!

  • - -
  • Shortly after I started the re-indexing UptimeRobot began to complain that CGSpace was down, then up, then down, then up…

  • - -
  • I see the load on the server is about 10.0 again for some reason though I don’t know WHAT is causing that load

    - + 1 37.48.65.147 + 1 80.113.172.162 + 2 108.174.5.117 + 2 83.110.14.208 + 4 18.196.8.188 + 84 18.195.78.144 + 644 18.194.46.84 + 1144 18.196.196.108 + - -

    CPU day

    - -

    CPU week

    - -

    CPU year

    - +
  • +
  • Here are the Munin graphs of CPU usage for the last day, week, and year:
  • + +

    CPU day

    +

    CPU week

    +

    CPU year

    + +
  • In other news, I see that it's not even the end of the month yet and we have 3.6 million hits already:
  • +
    # zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Mar/2019"
     3654911
    -
    - -
  • In other other news I see that DSpace has no statistics for years before 2019 currently, yet when I connect to Solr I see all the cores up

  • + - -

    2019-03-29

    - +

    2019-03-29

    +
  • I restarted Tomcat to see if I could fix the missing pre-2019 statistics (yes it fixed it) -
  • - -

    2019-03-31

    - + + +

    2019-03-31

    - -

    linode18 CPU usage after migration

    - +

    linode18 CPU usage after migration

    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Mar/2019"
     4218841
     
     real    0m26.609s
     user    0m31.657s
     sys     0m2.551s
    -
    - -
  • Interestingly, now that the CPU steal is not an issue the REST API is ten seconds faster than it was in 2018-10:

    - +
    $ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
     ...
     0.33s user 0.07s system 2% cpu 17.167 total
     0.27s user 0.04s system 1% cpu 16.643 total
     0.24s user 0.09s system 1% cpu 17.764 total
     0.25s user 0.06s system 1% cpu 15.947 total
    -
  • - -
  • I did some research on dedicated servers to potentially replace Linode for CGSpace stuff and it seems Hetzner is pretty good

    - + +
  • +
  • Looking at the weird issue with shitloads of downloads on the CTA item again
  • +
  • The item was added on 2019-03-13 and these three IPs have attempted to download the item's bitstream 43,000 times since it was added eighteen days ago:
  • +
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.{2..17}.gz | grep 'Spore-192-EN-web.pdf' | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 5
    - 42 196.43.180.134
    -621 185.247.144.227
    -8102 18.194.46.84
    -14927 18.196.196.108
    -20265 18.195.78.144
    -
    - -
  • I will send a mail to CTA to ask if they know these IPs

  • - -
  • I wonder if the Cocoon errors we had earlier this month were inadvertently related to the CPU steal issue… I see very low occurrences of the “Can not load requested doc” error in the Cocoon logs the past few days

  • - -
  • Helping Perttu debug some issues with the REST API on DSpace Test

    - + 42 196.43.180.134 + 621 185.247.144.227 + 8102 18.194.46.84 + 14927 18.196.196.108 + 20265 18.195.78.144 +
    2019-03-29 09:10:07,311 ERROR org.dspace.rest.Resource @ Could not delete collection(id=1451), AuthorizeException. Message: org.dspace.authorize.AuthorizeException: Authorization denied for action ADMIN on COLLECTION:1451 by user 9492
    -
  • - - -
  • IWMI people emailed to ask why two items with the same DOI don’t have the same Altmetric score:

    - + +
  • +
  • Only the second one has an Altmetric score (208)
  • +
  • I tweeted handles for both of them to see if Altmetric will pick it up +
  • +
    _altmetric.embed_callback({"title":"Distilling the role of ecosystem services in the Sustainable Development Goals","doi":"10.1016/j.ecoser.2017.10.010","tq":["Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of","Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers","How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!","Excellent paper about the contribution of #ecosystemservices to SDGs","So great to work with amazing collaborators"],"altmetric_jid":"521611533cf058827c00000a","issns":["2212-0416"],"journal":"Ecosystem Services","cohorts":{"sci":58,"pub":239,"doc":3,"com":2},"context":{"all":{"count":12732768,"mean":7.8220956572788,"rank":56146,"pct":99,"higher_than":12676701},"journal":{"count":549,"mean":7.7567299270073,"rank":2,"pct":99,"higher_than":547},"similar_age_3m":{"count":386919,"mean":11.573702536454,"rank":3299,"pct":99,"higher_than":383619},"similar_age_journal_3m":{"count":28,"mean":9.5648148148148,"rank":1,"pct":96,"higher_than":27}},"authors":["Sylvia L.R. Wood","Sarah K. Jones","Justin A. Johnson","Kate A. Brauman","Rebecca Chaplin-Kramer","Alexander Fremier","Evan Girvetz","Line J. Gordon","Carrie V. Kappel","Lisa Mandle","Mark Mulligan","Patrick O'Farrell","William K. Smith","Louise Willemen","Wei Zhang","Fabrice A. DeClerck"],"type":"article","handles":["10568/89975","10568/89846"],"handle":"10568/89975","altmetric_id":29816439,"schema":"1.5.4","is_oa":false,"cited_by_posts_count":377,"cited_by_tweeters_count":302,"cited_by_fbwalls_count":1,"cited_by_gplus_count":1,"cited_by_policies_count":2,"cited_by_accounts_count":306,"last_updated":1554039125,"score":208.65,"history":{"1y":54.75,"6m":10.35,"3m":5.5,"1m":5.5,"1w":1.5,"6d":1.5,"5d":1.5,"4d":1.5,"3d":1.5,"2d":1,"1d":1,"at":208.65},"url":"http://dx.doi.org/10.1016/j.ecoser.2017.10.010","added_on":1512153726,"published_on":1517443200,"readers":{"citeulike":0,"mendeley":248,"connotea":0},"readers_count":248,"images":{"small":"https://badges.altmetric.com/?size=64&score=209&types=tttttfdg","medium":"https://badges.altmetric.com/?size=100&score=209&types=tttttfdg","large":"https://badges.altmetric.com/?size=180&score=209&types=tttttfdg"},"details_url":"http://www.altmetric.com/details.php?citation_id=29816439"})
    -
    - - -
  • The response paylod for the second one is the same:

    - +
    _altmetric.embed_callback({"title":"Distilling the role of ecosystem services in the Sustainable Development Goals","doi":"10.1016/j.ecoser.2017.10.010","tq":["Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of","Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers","How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!","Excellent paper about the contribution of #ecosystemservices to SDGs","So great to work with amazing collaborators"],"altmetric_jid":"521611533cf058827c00000a","issns":["2212-0416"],"journal":"Ecosystem Services","cohorts":{"sci":58,"pub":239,"doc":3,"com":2},"context":{"all":{"count":12732768,"mean":7.8220956572788,"rank":56146,"pct":99,"higher_than":12676701},"journal":{"count":549,"mean":7.7567299270073,"rank":2,"pct":99,"higher_than":547},"similar_age_3m":{"count":386919,"mean":11.573702536454,"rank":3299,"pct":99,"higher_than":383619},"similar_age_journal_3m":{"count":28,"mean":9.5648148148148,"rank":1,"pct":96,"higher_than":27}},"authors":["Sylvia L.R. Wood","Sarah K. Jones","Justin A. Johnson","Kate A. Brauman","Rebecca Chaplin-Kramer","Alexander Fremier","Evan Girvetz","Line J. Gordon","Carrie V. Kappel","Lisa Mandle","Mark Mulligan","Patrick O'Farrell","William K. Smith","Louise Willemen","Wei Zhang","Fabrice A. DeClerck"],"type":"article","handles":["10568/89975","10568/89846"],"handle":"10568/89975","altmetric_id":29816439,"schema":"1.5.4","is_oa":false,"cited_by_posts_count":377,"cited_by_tweeters_count":302,"cited_by_fbwalls_count":1,"cited_by_gplus_count":1,"cited_by_policies_count":2,"cited_by_accounts_count":306,"last_updated":1554039125,"score":208.65,"history":{"1y":54.75,"6m":10.35,"3m":5.5,"1m":5.5,"1w":1.5,"6d":1.5,"5d":1.5,"4d":1.5,"3d":1.5,"2d":1,"1d":1,"at":208.65},"url":"http://dx.doi.org/10.1016/j.ecoser.2017.10.010","added_on":1512153726,"published_on":1517443200,"readers":{"citeulike":0,"mendeley":248,"connotea":0},"readers_count":248,"images":{"small":"https://badges.altmetric.com/?size=64&score=209&types=tttttfdg","medium":"https://badges.altmetric.com/?size=100&score=209&types=tttttfdg","large":"https://badges.altmetric.com/?size=180&score=209&types=tttttfdg"},"details_url":"http://www.altmetric.com/details.php?citation_id=29816439"})
    -
  • - -
  • Very interesting to see this in the response:

    - +
    "handles":["10568/89975","10568/89846"],
     "handle":"10568/89975"
    -
  • - -
  • On further inspection I see that the Altmetric explorer pages for each of these Handles is actually doing the right thing:

    - + - - +
  • +
  • So it's likely the DSpace Altmetric badge code that is deciding not to show the badge
  • + + diff --git a/docs/2019-04/index.html b/docs/2019-04/index.html index 694c1c183..fded445ed 100644 --- a/docs/2019-04/index.html +++ b/docs/2019-04/index.html @@ -8,32 +8,27 @@ @@ -44,34 +39,29 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace - + @@ -81,7 +71,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace "@type": "BlogPosting", "headline": "April, 2019", "url": "https:\/\/alanorth.github.io\/cgspace-notes\/2019-04\/", - "wordCount": "6799", + "wordCount": "6778", "datePublished": "2019-04-01T09:00:43+03:00", "dateModified": "2019-10-28T13:39:25+02:00", "author": { @@ -152,265 +142,223 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace

    -

    2019-04-01

    - +

    2019-04-01

    + +
  • There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today +
  • +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
    -4432 200
    -
    - - -
  • In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses

  • - -
  • Apply country and region corrections and deletions on DSpace Test and CGSpace:

    - + 4432 200 +
    $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
     $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
    -
  • - - -

    2019-04-02

    - +

    2019-04-02

    - -

    2019-04-03

    - + + +

    2019-04-03

    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2019-04-03-orcid-ids.txt
    -
    - - -
  • We currently have 1177 unique ORCID identifiers, and this brings our total to 1237!

  • - -
  • Next I will resolve all their names using my resolve-orcids.py script:

    - -
    $ ./resolve-orcids.py -i /tmp/2019-04-03-orcid-ids.txt -o 2019-04-03-orcid-ids.txt -d
    -
  • - -
  • After that I added the XML formatting, formatted the file with tidy, and sorted the names in vim

  • - -
  • One user’s name has changed so I will update those using my fix-metadata-values.py script:

    - -
    $ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
    -
  • - -
  • I created a pull request and merged the changes to the 5_x-prod branch (#417)

  • - -
  • A few days ago I noticed some weird update process for the statistics-2018 Solr core and I see it’s still going:

    - -
    2019-04-03 16:34:02,262 INFO  org.dspace.statistics.SolrLogger @ Updating : 1754500/21701 docs in http://localhost:8081/solr//statistics-2018
    -
  • - -
  • Interestingly, there are 5666 occurences, and they are mostly for the 2018 core:

    - -
    $ grep 'org.dspace.statistics.SolrLogger @ Updating' /home/cgspace.cgiar.org/log/dspace.log.2019-04-03 | awk '{print $11}' | sort | uniq -c
    -  1 
    -  3 http://localhost:8081/solr//statistics-2017
    -5662 http://localhost:8081/solr//statistics-2018
    -
  • - -
  • I will have to keep an eye on it because nothing should be updating 2018 stats in 2019…

  • + - -

    2019-04-05

    - +
    $ ./resolve-orcids.py -i /tmp/2019-04-03-orcid-ids.txt -o 2019-04-03-orcid-ids.txt -d
    +
    +
    $ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
    +
    +
    2019-04-03 16:34:02,262 INFO  org.dspace.statistics.SolrLogger @ Updating : 1754500/21701 docs in http://localhost:8081/solr//statistics-2018
    +
    +
    $ grep 'org.dspace.statistics.SolrLogger @ Updating' /home/cgspace.cgiar.org/log/dspace.log.2019-04-03 | awk '{print $11}' | sort | uniq -c
    +      1 
    +      3 http://localhost:8081/solr//statistics-2017
    +   5662 http://localhost:8081/solr//statistics-2018
    +
    +

    2019-04-05

    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    -  5 dspaceApi
    - 10 dspaceCli
    -250 dspaceWeb
    -
    - -
  • I still see those weird messages about updating the statistics-2018 Solr core:

    - + 5 dspaceApi + 10 dspaceCli + 250 dspaceWeb +
    2019-04-05 21:06:53,770 INFO  org.dspace.statistics.SolrLogger @ Updating : 2444600/21697 docs in http://localhost:8081/solr//statistics-2018
    -
  • - -
  • Looking at iostat 1 10 I also see some CPU steal has come back, and I can confirm it by looking at the Munin graphs:

  • + - -

    CPU usage week

    - +

    CPU usage week

    statistics-2017: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher 
    -
    - - -
  • I restarted it again and all the Solr cores came up properly…

  • + - -

    2019-04-06

    - +

    2019-04-06

    + +
  • Linode sent an alert that there was high CPU usage this morning on CGSpace (linode18) and these were the top IPs in the webserver access logs around the time:
  • +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -222 18.195.78.144
    -245 207.46.13.58
    -303 207.46.13.194
    -328 66.249.79.33
    -564 207.46.13.210
    -566 66.249.79.62
    -575 40.77.167.66
    -1803 66.249.79.59
    -2834 2a01:4f8:140:3192::2
    -9623 45.5.184.72
    +    222 18.195.78.144
    +    245 207.46.13.58
    +    303 207.46.13.194
    +    328 66.249.79.33
    +    564 207.46.13.210
    +    566 66.249.79.62
    +    575 40.77.167.66
    +   1803 66.249.79.59
    +   2834 2a01:4f8:140:3192::2
    +   9623 45.5.184.72
     # zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    - 31 66.249.79.62
    - 41 207.46.13.210
    - 42 40.77.167.66
    - 54 42.113.50.219
    -132 66.249.79.59
    -785 2001:41d0:d:1990::
    -1164 45.5.184.72
    -2014 50.116.102.77
    -4267 45.5.186.2
    -4893 205.186.128.185
    -
    - -
  • 45.5.184.72 is in Colombia so it’s probably CIAT, and I see they are indeed trying to get crawl the Discover pages on CIAT’s datasets collection:

    - + 31 66.249.79.62 + 41 207.46.13.210 + 42 40.77.167.66 + 54 42.113.50.219 + 132 66.249.79.59 + 785 2001:41d0:d:1990:: + 1164 45.5.184.72 + 2014 50.116.102.77 + 4267 45.5.186.2 + 4893 205.186.128.185 +
    GET /handle/10568/72970/discover?filtertype_0=type&filtertype_1=author&filter_relational_operator_1=contains&filter_relational_operator_0=equals&filter_1=&filter_0=Dataset&filtertype=dateIssued&filter_relational_operator=equals&filter=2014
    -
  • - -
  • Their user agent is the one I added to the badbots list in nginx last week: “GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1”

  • - -
  • They made 22,000 requests to Discover on this collection today alone (and it’s only 11AM):

    - +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "06/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c 
    -22077 /handle/10568/72970/discover
    -
  • - -
  • Yesterday they made 43,000 requests and we actually blocked most of them:

    - + 22077 /handle/10568/72970/discover +
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c 
    -43631 /handle/10568/72970/discover
    +  43631 /handle/10568/72970/discover
     # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -E '/handle/[0-9]+/[0-9]+/discover' | awk '{print $9}' | sort | uniq -c 
    -142 200
    -43489 503
    -
  • - -
  • I need to find a contact at CIAT to tell them to use the REST API rather than crawling Discover

  • - -
  • Maria from Bioversity recommended that we use the phrase “AGROVOC subject” instead of “Subject” in Listings and Reports

    - + 142 200 + 43489 503 + - -

    2019-04-07

    - +
  • + +

    2019-04-07

    $ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-03&rows=0&wt=json&indent=true'
     {
    -"response": {
    -"docs": [],
    -"numFound": 96925,
    -"start": 0
    -},
    -"responseHeader": {
    -"QTime": 1,
    -"params": {
    -    "fq": [
    -        "statistics_type:view",
    -        "bundleName:ORIGINAL",
    -        "dateYearMonth:2019-03"
    -    ],
    -    "indent": "true",
    -    "q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)",
    -    "rows": "0",
    -    "wt": "json"
    -},
    -"status": 0
    +    "response": {
    +        "docs": [],
    +        "numFound": 96925,
    +        "start": 0
    +    },
    +    "responseHeader": {
    +        "QTime": 1,
    +        "params": {
    +            "fq": [
    +                "statistics_type:view",
    +                "bundleName:ORIGINAL",
    +                "dateYearMonth:2019-03"
    +            ],
    +            "indent": "true",
    +            "q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)",
    +            "rows": "0",
    +            "wt": "json"
    +        },
    +        "status": 0
    +    }
     }
    -}
    -
    - - -
  • Strangely I don’t see many hits in 2019-04:

    - +
    $ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-04&rows=0&wt=json&indent=true'
     {
    -"response": {
    -    "docs": [],
    -    "numFound": 38,
    -    "start": 0
    -},
    -"responseHeader": {
    -    "QTime": 1,
    -    "params": {
    -        "fq": [
    -            "statistics_type:view",
    -            "bundleName:ORIGINAL",
    -            "dateYearMonth:2019-04"
    -        ],
    -        "indent": "true",
    -        "q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)",
    -        "rows": "0",
    -        "wt": "json"
    +    "response": {
    +        "docs": [],
    +        "numFound": 38,
    +        "start": 0
         },
    -    "status": 0
    +    "responseHeader": {
    +        "QTime": 1,
    +        "params": {
    +            "fq": [
    +                "statistics_type:view",
    +                "bundleName:ORIGINAL",
    +                "dateYearMonth:2019-04"
    +            ],
    +            "indent": "true",
    +            "q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)",
    +            "rows": "0",
    +            "wt": "json"
    +        },
    +        "status": 0
    +    }
     }
    -}
    -
  • - -
  • Making some tests on GET vs HEAD requests on the CTA Spore 192 item on DSpace Test:

    - +
    $ http --print Hh GET https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf
     GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1
     Accept: */*
    @@ -460,350 +408,294 @@ X-Content-Type-Options: nosniff
     X-Frame-Options: SAMEORIGIN
     X-Robots-Tag: none
     X-XSS-Protection: 1; mode=block
    -
  • - -
  • And from the server side, the nginx logs show:

    - +
    78.x.x.x - - [07/Apr/2019:01:38:35 -0700] "GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 68078 "-" "HTTPie/1.0.2"
     78.x.x.x - - [07/Apr/2019:01:39:01 -0700] "HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 0 "-" "HTTPie/1.0.2"
    -
  • - -
  • So definitely the size of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr

    - +
    2019-04-07 02:05:30,966 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=EF2DB6E4F69926C5555B3492BB0071A8:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
     2019-04-07 02:05:39,265 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=B6381FC590A5160D84930102D068C7A3:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
    -
  • - - -
  • So my inclination is that both HEAD and GET requests are registered as views as far as Solr and DSpace are concerned

    - +
    2019-04-07 02:08:44,186 INFO  org.apache.solr.update.UpdateHandler @ start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
    -
  • - - -
  • Ugh, even after optimizing there are no Solr results for requests from my IP, and actually I only see 18 results from 2019-04 so far and none of them are statistics_type:view… very weird

    - + +
  • +
  • According to the DSpace 5.x Solr documentation the default commit time is after 15 minutes or 10,000 documents (see solrconfig.xml)
  • +
  • I looped some GET and HEAD requests to a bitstream on my local instance and after some time I see that they do register as downloads (even though they are internal):
  • +
    $ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&fq=statistics_type%3Aview&fq=isInternal%3Atrue&rows=0&wt=json&indent=true'
     {
    -"response": {
    -    "docs": [],
    -    "numFound": 909,
    -    "start": 0
    -},
    -"responseHeader": {
    -    "QTime": 0,
    -    "params": {
    -        "fq": [
    -            "statistics_type:view",
    -            "isInternal:true"
    -        ],
    -        "indent": "true",
    -        "q": "type:0 AND time:2019-04-07*",
    -        "rows": "0",
    -        "wt": "json"
    +    "response": {
    +        "docs": [],
    +        "numFound": 909,
    +        "start": 0
         },
    -    "status": 0
    +    "responseHeader": {
    +        "QTime": 0,
    +        "params": {
    +            "fq": [
    +                "statistics_type:view",
    +                "isInternal:true"
    +            ],
    +            "indent": "true",
    +            "q": "type:0 AND time:2019-04-07*",
    +            "rows": "0",
    +            "wt": "json"
    +        },
    +        "status": 0
    +    }
     }
    -}
    -
    - -
  • I confirmed the same on CGSpace itself after making one HEAD request

  • - -
  • So I’m pretty sure it’s something about DSpace Test using the CGSpace statistics core, and not that I deployed Solr 4.10.4 there last week

    - + +
  • +
  • Now this gets more frustrating: I did the same GET and HEAD tests on a local Ubuntu 16.04 VM with Solr 4.10.2 and 4.10.4 and the statistics are recorded
  • - -
  • Holy shit, all this is actually because of the GeoIP1 deprecation and a missing GeoLiteCity.dat

    - + +
  • +
  • Holy shit, all this is actually because of the GeoIP1 deprecation and a missing GeoLiteCity.dat
  • - -
  • UptimeRobot said CGSpace went down and up a few times tonight, and my first instict was to check iostat 1 10 and I saw that CPU steal is around 10–30 percent right now…

  • - -
  • The load average is super high right now, as I’ve noticed the last few times UptimeRobot said that CGSpace went down:

    - + +
  • +
  • UptimeRobot said CGSpace went down and up a few times tonight, and my first instict was to check iostat 1 10 and I saw that CPU steal is around 10–30 percent right now…
  • +
  • The load average is super high right now, as I've noticed the last few times UptimeRobot said that CGSpace went down:
  • +
    $ cat /proc/loadavg 
     10.70 9.17 8.85 18/633 4198
    -
    - -
  • According to the server logs there is actually not much going on right now:

    - -
    # zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "07/Apr/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -118 18.195.78.144
    -128 207.46.13.219
    -129 167.114.64.100
    -159 207.46.13.129
    -179 207.46.13.33
    -188 2408:8214:7a00:868f:7c1e:e0f3:20c6:c142
    -195 66.249.79.59
    -363 40.77.167.21
    -740 2a01:4f8:140:3192::2
    -4823 45.5.184.72
    -# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "07/Apr/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -  3 66.249.79.62
    -  3 66.249.83.196
    -  4 207.46.13.86
    -  5 82.145.222.150
    -  6 2a01:4f9:2b:1263::2
    -  6 41.204.190.40
    -  7 35.174.176.49
    - 10 40.77.167.21
    - 11 194.246.119.6
    - 11 66.249.79.59
    -
  • - -
  • 45.5.184.72 is CIAT, who I already blocked and am waiting to hear from

  • - -
  • 2a01:4f8:140:3192::2 is BLEXbot, which should be handled by the Tomcat Crawler Session Manager Valve

  • - -
  • 2408:8214:7a00:868f:7c1e:e0f3:20c6:c142 is some stupid Chinese bot making malicious POST requests

  • - -
  • There are free database connections in the pool:

    - -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    -  5 dspaceApi
    -  7 dspaceCli
    - 23 dspaceWeb
    -
  • - -
  • It seems that the issue with CGSpace being “down” is actually because of CPU steal again!!!

  • - -
  • I opened a ticket with support and asked them to migrate the VM to a less busy host

  • + - -

    2019-04-08

    - +
    # zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "07/Apr/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    118 18.195.78.144
    +    128 207.46.13.219
    +    129 167.114.64.100
    +    159 207.46.13.129
    +    179 207.46.13.33
    +    188 2408:8214:7a00:868f:7c1e:e0f3:20c6:c142
    +    195 66.249.79.59
    +    363 40.77.167.21
    +    740 2a01:4f8:140:3192::2
    +   4823 45.5.184.72
    +# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "07/Apr/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +      3 66.249.79.62
    +      3 66.249.83.196
    +      4 207.46.13.86
    +      5 82.145.222.150
    +      6 2a01:4f9:2b:1263::2
    +      6 41.204.190.40
    +      7 35.174.176.49
    +     10 40.77.167.21
    +     11 194.246.119.6
    +     11 66.249.79.59
    +
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +      5 dspaceApi
    +      7 dspaceCli
    +     23 dspaceWeb
    +
    +

    2019-04-08

    $ lein run ~/src/git/DSpace/2019-02-22-affiliations.csv name id
    -
    - - -
  • After matching the values and creating some new matches I had trouble remembering how to copy the reconciled values to a new column

    - +
    if(cell.recon.matched, cell.recon.match.name, value)
    -
  • - - -
  • See the OpenRefine variables documentation for more notes about the recon object

  • - -
  • I also noticed a handful of errors in our current list of affiliations so I corrected them:

    - +
    $ ./fix-metadata-values.py -i 2019-04-08-fix-13-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
    -
  • - -
  • We should create a new list of affiliations to update our controlled vocabulary again

  • - -
  • I dumped a list of the top 1500 affiliations:

    - +
    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 211 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-04-08-top-1500-affiliations.csv WITH CSV HEADER;
     COPY 1500
    -
  • - -
  • Fix a few more messed up affiliations that have return characters in them (use Ctrl-V Ctrl-M to re-create control character):

    - +
    dspace=# UPDATE metadatavalue SET text_value='International Institute for Environment and Development' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'International Institute^M%';
     dspace=# UPDATE metadatavalue SET text_value='Kenya Agriculture and Livestock Research Organization' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'Kenya Agricultural  and Livestock  Research^M%';
    -
  • - -
  • I noticed a bunch of subjects and affiliations that use stylized apostrophes so I will export those and then batch update them:

    - +
    dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE '%’%') to /tmp/2019-04-08-affiliations-apostrophes.csv WITH CSV HEADER;
     COPY 60
     dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 57 AND text_value LIKE '%’%') to /tmp/2019-04-08-subject-apostrophes.csv WITH CSV HEADER;
     COPY 20
    -
  • - -
  • I cleaned them up in OpenRefine and then applied the fixes on CGSpace and DSpace Test:

    - +
    $ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-60-affiliations-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
     $ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-20-subject-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
    -
  • - -
  • UptimeRobot said that CGSpace (linode18) went down tonight

    - +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    -5 dspaceApi
    -7 dspaceCli
    -250 dspaceWeb
    -
  • - - -
  • On a related note I see connection pool errors in the DSpace log:

    - + 5 dspaceApi + 7 dspaceCli + 250 dspaceWeb +
    2019-04-08 19:01:10,472 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - 
     org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-319] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
    -
  • - -
  • But still I see 10 to 30% CPU steal in iostat that is also reflected in the Munin graphs:

  • + - -

    CPU usage week

    - +

    CPU usage week

    - -

    2019-04-09

    - +
    # zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "08/Apr/2019:(17|18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    124 40.77.167.135
    +    135 95.108.181.88
    +    139 157.55.39.206
    +    190 66.249.79.133
    +    202 45.5.186.2
    +    284 207.46.13.95
    +    359 18.196.196.108
    +    457 157.55.39.164
    +    457 40.77.167.132
    +   3822 45.5.184.72
    +# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "08/Apr/2019:(17|18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +      5 129.0.79.206
    +      5 41.205.240.21
    +      7 207.46.13.95
    +      7 66.249.79.133
    +      7 66.249.79.135
    +      7 95.108.181.88
    +      8 40.77.167.111
    +     19 157.55.39.164
    +     20 40.77.167.132
    +    370 51.254.16.223
    +

    2019-04-09

    # zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "09/Apr/2019:(06|07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    - 18 66.249.79.139
    - 21 157.55.39.160
    - 29 66.249.79.137
    - 38 66.249.79.135
    - 50 34.200.212.137
    - 54 66.249.79.133
    -100 102.128.190.18
    -1166 45.5.184.72
    -4251 45.5.186.2
    -4895 205.186.128.185
    +     18 66.249.79.139
    +     21 157.55.39.160
    +     29 66.249.79.137
    +     38 66.249.79.135
    +     50 34.200.212.137
    +     54 66.249.79.133
    +    100 102.128.190.18
    +   1166 45.5.184.72
    +   4251 45.5.186.2
    +   4895 205.186.128.185
     # zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "09/Apr/2019:(06|07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -200 144.48.242.108
    -202 207.46.13.185
    -206 18.194.46.84
    -239 66.249.79.139
    -246 18.196.196.108
    -274 31.6.77.23
    -289 66.249.79.137
    -312 157.55.39.160
    -441 66.249.79.135
    -856 66.249.79.133
    -
    - -
  • 45.5.186.2 is at CIAT in Colombia and I see they are mostly making requests to the REST API, but also to XMLUI with the following user agent:

    - + 200 144.48.242.108 + 202 207.46.13.185 + 206 18.194.46.84 + 239 66.249.79.139 + 246 18.196.196.108 + 274 31.6.77.23 + 289 66.249.79.137 + 312 157.55.39.160 + 441 66.249.79.135 + 856 66.249.79.133 +
    Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36
    -
  • - -
  • Database connection usage looks fine:

    - +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    -  5 dspaceApi
    -  7 dspaceCli
    - 11 dspaceWeb
    -
  • - -
  • Ironically I do still see some 2 to 10% of CPU steal in iostat 1 10

  • - -
  • Leroy from CIAT contacted me to say he knows the team who is making all those requests to CGSpace

    - + 5 dspaceApi + 7 dspaceCli + 11 dspaceWeb + - -

    2019-04-10

    - +
  • +
  • In other news, Linode staff identified a noisy neighbor sharing our host and migrated it elsewhere last night
  • + +

    2019-04-10

    $ http 'https://api.crossref.org/funders?query=mercator&mailto=me@cgiar.org'
    -
    - -
  • Otherwise, they provide the funder data in CSV and RDF format

  • - -
  • I did a quick test with the recent IITA records against reconcile-csv in OpenRefine and it matched a few, but the ones that didn’t match will need a human to go and do some manual checking and informed decision making…

  • - -
  • If I want to write a script for this I could use the Python habanero library:

    - +
    from habanero import Crossref
     cr = Crossref(mailto="me@cgiar.org")
     x = cr.funders(query = "mercator")
    -
  • - - -

    2019-04-11

    - +

    2019-04-11

    + +
  • I captured a few general corrections and deletions for AGROVOC subjects while looking at IITA's records, so I applied them to DSpace Test and CGSpace:
  • +
    $ ./fix-metadata-values.py -i /tmp/2019-04-11-fix-14-subjects.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
     $ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d
    -
    - -
  • Answer more questions about DOIs and Altmetric scores from WLE

  • - -
  • Answer more questions about DOIs and Altmetric scores from IWMI

    - + - -

    2019-04-13

    - +
  • + +

    2019-04-13

    - -

    Java GC during Solr indexing with CMS

    - +

    Java GC during Solr indexing with CMS

    +
  • I tried again with the GC tuning settings from the Solr 4.10.4 release:
  • - -

    Java GC during Solr indexing Solr 4.10.4 settings

    - -

    2019-04-14

    - +

    Java GC during Solr indexing Solr 4.10.4 settings

    +

    2019-04-14

    - -

    2019-04-15

    - +
    GC_TUNE="-XX:NewRatio=3 \
    +    -XX:SurvivorRatio=4 \
    +    -XX:TargetSurvivorRatio=90 \
    +    -XX:MaxTenuringThreshold=8 \
    +    -XX:+UseConcMarkSweepGC \
    +    -XX:+UseParNewGC \
    +    -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \
    +    -XX:+CMSScavengeBeforeRemark \
    +    -XX:PretenureSizeThreshold=64m \
    +    -XX:+UseCMSInitiatingOccupancyOnly \
    +    -XX:CMSInitiatingOccupancyFraction=50 \
    +    -XX:CMSMaxAbortablePrecleanTime=6000 \
    +    -XX:+CMSParallelRemarkEnabled \
    +    -XX:+ParallelRefProcEnabled"
    +
    +

    2019-04-15

    + +
  • Pretty annoying to see CGSpace (linode18) with 20–50% CPU steal according to iostat 1 10, though I haven't had any Linode alerts in a few days
  • +
  • Abenet sent me a list of ILRI items that don't have CRPs added to them +
  • +
    import json
     import re
     import urllib
    @@ -912,184 +793,206 @@ data = json.load(res)
     item_id = data['id']
     
     return item_id
    -
    - - -
  • Luckily none of the items already had CRPs, so I didn’t have to worry about them getting removed

    - + +
  • +
  • I ran a full Discovery indexing on CGSpace because I didn't do it after all the metadata updates last week:
  • +
    $ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    82m45.324s
     user    7m33.446s
     sys     2m13.463s
    -
    - - -

    2019-04-16

    - +

    2019-04-16

    - -

    2019-04-17

    - +

    2019-04-17

    +
  • 4GB heap, CMS GC, 1024 filter cache, 512 query cache, with 28 million documents in two shards - +
  • + +
  • 4GB heap, CMS GC, 2048 filter cache, 512 query cache, with 28 million documents in two shards - +
  • + +
  • 4GB heap, CMS GC, 4096 filter cache, 512 query cache, with 28 million documents in two shards - +
  • + +
  • The biggest takeaway I have is that this workload benefits from a larger filterCache (for Solr fq parameter), but barely uses the queryResultCache (for Solr q parameter) at all -
  • +
  • The number of hits goes up and the time taken decreases when we increase the filterCache, and total JVM heap memory doesn't seem to increase much at all
  • +
  • I guess the queryResultCache size is always 2 because I'm only doing two queries: type:0 and type:2 (downloads and views, respectively)
  • + +
  • Here is the general pattern of running three sequential indexing runs as seen in VisualVM while monitoring the Tomcat process:
  • - -

    VisualVM Tomcat 4096 filterCache

    - +

    VisualVM Tomcat 4096 filterCache

    +
  • The JVM garbage collection graph is MUCH flatter, and memory usage is much lower (not to mention a drop in GC-related CPU usage)!
  • - -

    VisualVM Tomcat 16384 filterCache

    - +

    VisualVM Tomcat 16384 filterCache

    - -

    CPU usage week

    - -

    2019-04-18

    - +

    CPU usage week

    +

    2019-04-18

    +
  • Deploy Tomcat 7.0.94 on DSpace Test (linode19) -
  • -
  • UptimeRobot says that CGSpace went “down” this afternoon, but I looked at the CPU steal with iostat 1 10 and it’s in the 50s and 60s - +
  • I needed to use the “folded” YAML variable format >- (with the dash so it doesn't add a return at the end)
  • + + +
  • UptimeRobot says that CGSpace went “down” this afternoon, but I looked at the CPU steal with iostat 1 10 and it's in the 50s and 60s
  • - -

    CPU usage week

    - + + +

    CPU usage week

    - -

    2019-04-20

    - + + +

    2019-04-20

    - -

    CPU usage week

    - +

    CPU usage week

    # iperf -s
     ------------------------------------------------------------
     Server listening on TCP port 5001
    @@ -1104,301 +1007,252 @@ TCP window size: 85.0 KByte (default)
     [ ID] Interval       Transfer     Bandwidth
     [  5]  0.0-10.2 sec   172 MBytes   142 Mbits/sec
     [  4]  0.0-10.5 sec   202 MBytes   162 Mbits/sec
    -
    - -
  • Even with the software firewalls disabled the rsync speed was low, so it’s not a rate limiting issue

  • - -
  • I also tried to download a file over HTTPS from CGSpace to DSpace Test, but it was capped at 20KiB/sec

    - + +
  • +
  • I'm going to try to switch the kernel to the latest upstream (5.0.8) instead of Linode's latest x86_64
  • - -

    2019-04-21

    - + + +

    2019-04-21

    - -

    2019-04-22

    - + + +

    2019-04-22

    + +
  • I want to get rid of this annoying warning that is constantly in our DSpace logs:
  • +
    2019-04-08 19:02:31,770 WARN  org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
    -
    - -
  • Apparently it happens once per request, which can be at least 1,500 times per day according to the DSpace logs on CGSpace (linode18):

    - +
    $ grep -c 'Falling back to request address' dspace.log.2019-04-20
     dspace.log.2019-04-20:1515
    -
  • - -
  • I will fix it in dspace/config/modules/oai.cfg

  • - -
  • Linode says that it is likely that the host CGSpace (linode18) is on is showing signs of hardware failure and they recommended that I migrate the VM to a new host

    - + - -

    2019-04-23

    - +
  • + +

    2019-04-23

    - -
    In Azure, with one exception being the A0, there is no overprovisioning… Each physical cores is only supporting one virtual core.
    - + - -

    2019-04-24

    - + + +

    2019-04-24

    +
  • Finally upload the 218 IITA items from March to CGSpace -
  • - -
  • While I was uploading the IITA records I noticed that twenty of the records Sisay uploaded in 2018-09 had double Handles (dc.identifier.uri)

    - + +
  • +
  • While I was uploading the IITA records I noticed that twenty of the records Sisay uploaded in 2018-09 had double Handles (dc.identifier.uri) +
  • +
    $ csvcut -c id,dc.identifier.uri,'dc.identifier.uri[]' ~/Downloads/2019-04-24-IITA.csv > /tmp/iita.csv
    -
    - - -
  • Carlos Tejo from the Land Portal had been emailing me this week to ask about the old REST API that Tsega was building in 2017

    - +
    $ curl -f -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": "en_US"}'
     curl: (22) The requested URL returned error: 401
    -
  • - - -
  • Note that curl only shows the HTTP 401 error if you use -f (fail), and only then if you don’t include -s

    - +
    dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='en_US';
    -count 
    + count 
     -------
    -376
    +   376
     (1 row)
     
     dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='';
    -count 
    + count 
     -------
    -149
    +   149
     (1 row)
     
     dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang IS NULL;
    -count 
    + count 
     -------
    -417
    +   417
     (1 row)
    -
  • - - -
  • I see that the HTTP 401 issue seems to be a bug due to an item that the user doesn’t have permission to access… from the DSpace log:

    - +
    2019-04-24 08:11:51,129 INFO  org.dspace.rest.ItemsResource @ Looking for item with metadata(key=cg.subject.cpwf,value=WATER MANAGEMENT, language=en_US).
     2019-04-24 08:11:51,231 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/72448
     2019-04-24 08:11:51,238 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/72491
     2019-04-24 08:11:51,243 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/75703
     2019-04-24 08:11:51,252 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item!
    -
  • - -
  • Nevertheless, if I request using the null language I get 1020 results, plus 179 for a blank language attribute:

    - +
    $ curl -s -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": null}' | jq length
     1020
     $ curl -s -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": ""}' | jq length
     179
    -
  • - -
  • This is weird because I see 942–1156 items with “WATER MANAGEMENT” (depending on wildcard matching for errors in subject spelling):

    - +
    dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT';
    -count 
    + count 
     -------
    -942
    +   942
     (1 row)
     
     dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value LIKE '%WATER MANAGEMENT%';
    -count 
    + count 
     -------
    -1156
    +  1156
     (1 row)
    -
  • - -
  • I sent a message to the dspace-tech mailing list to ask for help

  • + - -

    2019-04-25

    - +

    2019-04-25

    + +
  • I tested the REST API after logging in with my super admin account and I was able to get results for the problematic query:
  • +
    $ curl -f -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/login" -d '{"email":"example@me.com","password":"fuuuuu"}'
     $ curl -f -H "Content-Type: application/json" -H "rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b" -X GET "https://dspacetest.cgiar.org/rest/status"
     $ curl -f -H "rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": "en_US"}'
    -
    - -
  • I created a normal user for Carlos to try as an unprivileged user:

    - +
    $ dspace user --add --givenname Carlos --surname Tejo --email blah@blah.com --password 'ddmmdd'
    -
  • - -
  • But still I get the HTTP 401 and I have no idea which item is causing it

  • - -
  • I enabled more verbose logging in ItemsResource.java and now I can at least see the item ID that causes the failure…

    - +
    dspace=# SELECT * FROM item WHERE item_id=74648;
    -item_id | submitter_id | in_archive | withdrawn |       last_modified        | owning_collection | discoverable
    + item_id | submitter_id | in_archive | withdrawn |       last_modified        | owning_collection | discoverable
     ---------+--------------+------------+-----------+----------------------------+-------------------+--------------
    -74648 |          113 | f          | f         | 2016-03-30 09:00:52.131+00 |                   | t
    +   74648 |          113 | f          | f         | 2016-03-30 09:00:52.131+00 |                   | t
     (1 row)
    -
  • - - -
  • I tried to delete the item in the web interface, and it seems successful, but I can still access the item in the admin interface, and nothing changes in PostgreSQL

  • - -
  • Meet with CodeObia to see progress on AReS version 2

  • - -
  • Marissa Van Epp asked me to add a few new metadata values to their Phase II Project Tags field (cg.identifier.ccafsprojectpii)

    - + - -

    2019-04-26

    - +
  • +
  • Communicate with Carlos Tejo from the Land Portal about the /items/find-by-metadata-value endpoint
  • +
  • Run all system updates on DSpace Test (linode19) and reboot it
  • + +

    2019-04-26

    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-04-26-all-authors.csv with csv header;
     COPY 65752
    -
    +

    2019-04-28

    + - -

    2019-04-28

    - - - -
  • Then I exported the whole repository as CSV, imported it into OpenRefine, removed a few unneeded columns, exported it, zipped it down to 36MB, and emailed a link to Carlos

  • - -
  • In other news, while I was looking through the CSV in OpenRefine I saw lots of weird values in some fields… we should check, for example:

    - + - - +
  • + + diff --git a/docs/2019-05/index.html b/docs/2019-05/index.html index ab94f0f62..cde4a7fe1 100644 --- a/docs/2019-05/index.html +++ b/docs/2019-05/index.html @@ -8,11 +8,9 @@ @@ -34,11 +31,9 @@ But after this I tried to delete the item from the XMLUI and it is still present - + @@ -61,7 +55,7 @@ But after this I tried to delete the item from the XMLUI and it is still present "@type": "BlogPosting", "headline": "May, 2019", "url": "https:\/\/alanorth.github.io\/cgspace-notes\/2019-05\/", - "wordCount": "3215", + "wordCount": "3190", "datePublished": "2019-05-01T07:37:43+03:00", "dateModified": "2019-10-28T13:39:25+02:00", "author": { @@ -132,183 +126,154 @@ But after this I tried to delete the item from the XMLUI and it is still present

    -

    2019-05-01

    - +

    2019-05-01

    + +
  • The item seems to be in a pre-submitted state, so I tried to delete it from there:
  • +
    dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
     DELETE 1
    -
    - -
  • But after this I tried to delete the item from the XMLUI and it is still present…

  • + -
    dspace=# DELETE FROM metadatavalue WHERE resource_id=74648;
     dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
     dspace=# DELETE FROM item WHERE item_id=74648;
    -
    - - -
  • Now the item is (hopefully) really gone and I can continue to troubleshoot the issue with REST API’s /items/find-by-metadata-value endpoint

    - +
    $ curl -f -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": "en_US"}'
     curl: (22) The requested URL returned error: 401 Unauthorized
    -
  • - - -
  • The DSpace log shows the item ID (because I modified the error text):

    - +
    2019-05-01 11:41:11,069 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item(id=77708)!
    -
  • - -
  • If I delete that one I get another, making the list of item IDs so far:

    - + - -

    2019-05-03

    - +
  • +
  • Some are in the workspaceitem table (pre-submission), others are in the workflowitem table (submitted), and others are actually approved, but withdrawn… +
  • +
  • CIP is asking about embedding PDF thumbnail images in their RSS feeds again +
  • +
  • CIP also asked for a way to get an XML file of all their RTB journal articles on CGSpace + +
  • + +
    https://cgspace.cgiar.org/rest/collections/1179/items?limit=812&expand=metadata
    +

    2019-05-03

    +
    $ dspace test-email
     
     About to send test email:
    -- To: woohoo@cgiar.org
    -- Subject: DSpace test email
    -- Server: smtp.office365.com
    + - To: woohoo@cgiar.org
    + - Subject: DSpace test email
    + - Server: smtp.office365.com
     
     Error sending email:
    -- Error: javax.mail.AuthenticationFailedException
    + - Error: javax.mail.AuthenticationFailedException
     
     Please see the DSpace documentation for assistance.
    -
    - - -
  • I will ask ILRI ICT to reset the password

    - + - -

    2019-05-05

    - +
  • + +

    2019-05-05

    +
  • Re-deploy CGSpace from 5_x-prod branch
  • Run all system updates on CGSpace (linode18) and reboot it
  • Tag version 1.1.0 of the dspace-statistics-api (with Falcon 2.0.0) -
  • - -

    2019-05-06

    - + + +

    2019-05-06

    statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher 
    -
    - - -
  • As well as this error in the logs:

    - +
    Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
    -
  • - -
  • Strangely enough, I do see the statistics-2018, statistics-2017, etc cores in the Admin UI…

  • - -
  • I restarted Tomcat a few times (and even deleted all the Solr write locks) and at least five times there were issues loading one statistics core, causing the Atmire stats to be incomplete

    - + +
  • +
  • There were a few alerts from UptimeRobot about CGSpace going up and down this morning, along with an alert from Linode about 596% load
  • - -

    CGSpace XMLUI sessions day

    - -

    linode18 firewall connections day

    - -

    linode18 postgres connections day

    - -

    linode18 CPU day

    - + + +

    CGSpace XMLUI sessions day

    +

    linode18 firewall connections day

    +

    linode18 postgres connections day

    +

    linode18 CPU day

    $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-06 | sort | uniq | wc -l
     101108
     $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-05 | sort | uniq | wc -l
    @@ -321,10 +286,9 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-02 | sort | uniq | wc
     7758
     $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-01 | sort | uniq | wc -l
     20528
    -
    - -
  • The number of unique IP addresses from 2 to 6 AM this morning is already several times higher than the average for that time of the morning this past week:

    - +
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
     7127
     # zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep -E '05/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
    @@ -337,10 +301,9 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-01 | sort | uniq | wc
     1573
     # zcat --force /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.6.gz | grep -E '01/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
     1410
    -
  • - -
  • Just this morning between the hours of 2 and 6 the number of unique sessions was very high compared to previous mornings:

    - +
    $ cat dspace.log.2019-05-06 | grep -E '2019-05-06 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
     83650
     $ cat dspace.log.2019-05-05 | grep -E '2019-05-05 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
    @@ -353,53 +316,45 @@ $ cat dspace.log.2019-05-02 | grep -E '2019-05-02 (02|03|04|05|06):' | grep -o -
     2704
     $ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
     3699
    -
  • - -
  • Most of the requests were GETs:

    - +
    # cat /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E "(GET|HEAD|POST|PUT)" | sort | uniq -c | sort -n
    -  1 PUT
    - 98 POST
    -2845 HEAD
    -98121 GET
    -
  • - -
  • I’m not exactly sure what happened this morning, but it looks like some legitimate user traffic—perhaps someone launched a new publication and it got a bunch of hits?

  • - -
  • Looking again, I see 84,000 requests to /handle this morning (not including logs for library.cgiar.org because those get HTTP 301 redirect to CGSpace and appear here in access.log):

    - + 1 PUT + 98 POST + 2845 HEAD + 98121 GET +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -c -o -E " /handle/[0-9]+/[0-9]+"
     84350
    -
  • - -
  • But it would be difficult to find a pattern for those requests because they cover 78,000 unique Handles (ie direct browsing of items, collections, or communities) and only 2,492 discover/browse (total, not unique):

    - +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E " /handle/[0-9]+/[0-9]+ HTTP" | sort | uniq | wc -l
     78104
     # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E " /handle/[0-9]+/[0-9]+/(discover|browse)" | wc -l
     2492
    -
  • - -
  • In other news, I see some IP is making several requests per second to the exact same REST API endpoints, for example:

    - +
    # grep /rest/handle/10568/3703?expand=all rest.log | awk '{print $1}' | sort | uniq -c
    -  3 2a01:7e00::f03c:91ff:fe0a:d645
    -113 63.32.242.35
    -
  • - -
  • According to viewdns.info that server belongs to Macaroni Brothers’

    - + 3 2a01:7e00::f03c:91ff:fe0a:d645 + 113 63.32.242.35 + - -

    2019-05-07

    - +
  • + +

    2019-05-07

    # zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep -E '06/May/2019' | awk '{print $1}' | sort | uniq | wc -l
     13969
     # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '05/May/2019' | awk '{print $1}' | sort | uniq | wc -l
    @@ -408,10 +363,9 @@ $ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o -
     6229
     # zcat --force /var/log/nginx/access.log.4.gz /var/log/nginx/access.log.5.gz | grep -E '03/May/2019' | awk '{print $1}' | sort | uniq | wc -l
     8051
    -
    - -
  • Total number of sessions yesterday was much higher compared to days last week:

    - +
    $ cat dspace.log.2019-05-06 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
     144160
     $ cat dspace.log.2019-05-05 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
    @@ -424,69 +378,57 @@ $ cat dspace.log.2019-05-02 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq |
     26996
     $ cat dspace.log.2019-05-01 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
     61866
    -
  • - -
  • The usage statistics seem to agree that yesterday was crazy:

  • + - -

    Atmire Usage statistics spike 2019-05-06

    - +

    Atmire Usage statistics spike 2019-05-06

    +
  • Add requests cache to resolve-addresses.py script
  • - -

    2019-05-08

    - +

    2019-05-08

    $ dspace test-email
     
     About to send test email:
    -- To: wooooo@cgiar.org
    -- Subject: DSpace test email
    -- Server: smtp.office365.com
    + - To: wooooo@cgiar.org
    + - Subject: DSpace test email
    + - Server: smtp.office365.com
     
     Error sending email:
    -- Error: javax.mail.AuthenticationFailedException
    + - Error: javax.mail.AuthenticationFailedException
     
     Please see the DSpace documentation for assistance.
    -
    - - -
  • I checked the settings and apparently I had updated it incorrectly last week after ICT reset the password

  • - -
  • Help Moayad with certbot-auto for Let’s Encrypt scripts on the new AReS server (linode20)

  • - -
  • Normalize all text_lang values for metadata on CGSpace and DSpace Test (as I had tested last month):

    - +
    UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
     UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
     UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('es', 'spa');
    -
  • - -
  • Send Francesca Giampieri from Bioversity a CSV export of all their items issued in 2018

    - + - -

    2019-05-10

    - +
  • + +

    2019-05-10

    - -
    The attack that targeted the "Search" functionality of the website, aimed to bypass our mitigation by performing slow but simultaneous searches from 5500 IP addresses.
    - + +
  • All of the IPs from these networks are using generic user agents like this, but MANY more, and they change many times:
  • + +
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2703.0 Safari/537.36"
    +
    + - -

    2019-05-12

    - + + +

    2019-05-12

    $ cat dspace.log.2019-05-11 | grep -E 'ip_addr=(100.26.206.188|100.27.19.233|107.22.98.199|174.129.156.41|18.205.243.110|18.205.245.200|18.207.176.164|18.207.209.186|18.212.126.89|18.212.5.59|18.213.4.150|18.232.120.6|18.234.180.224|18.234.81.13|3.208.23.222|34.201.121.183|34.201.241.214|34.201.39.122|34.203.188.39|34.207.197.154|34.207.232.63|34.207.91.147|34.224.86.47|34.227.205.181|34.228.220.218|34.229.223.120|35.171.160.166|35.175.175.202|3.80.201.39|3.81.120.70|3.81.43.53|3.84.152.19|3.85.113.253|3.85.237.139|3.85.56.100|3.87.23.95|3.87.248.240|3.87.250.3|3.87.62.129|3.88.13.9|3.88.57.237|3.89.71.15|3.90.17.242|3.90.68.247|3.91.44.91|3.92.138.47|3.94.250.180|52.200.78.128|52.201.223.200|52.90.114.186|52.90.48.73|54.145.91.243|54.160.246.228|54.165.66.180|54.166.219.216|54.166.238.172|54.167.89.152|54.174.94.223|54.196.18.211|54.198.234.175|54.208.8.172|54.224.146.147|54.234.169.91|54.235.29.216|54.237.196.147|54.242.68.231|54.82.6.96|54.87.12.181|54.89.217.141|54.89.234.182|54.90.81.216|54.91.104.162)' | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l   
     2206
    -
    - -
  • I added “Unpaywall” to the list of bots in the Tomcat Crawler Session Manager Valve

  • - -
  • Set up nginx to use TLS and proxy pass to NodeJS on the AReS development server (linode20)

  • - -
  • Run all system updates on linode20 and reboot it

  • - -
  • Also, there is 10 to 20% CPU steal on that VM, so I will ask Linode to move it to another host

  • - -
  • Commit changes to the resolve-addresses.py script to add proper CSV output support

  • + - -

    2019-05-14

    - +

    2019-05-14

    - -

    2019-05-15

    - + + +

    2019-05-15

    - -

    2019-05-16

    - + + +

    2019-05-16

    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-05-16-investors.csv WITH CSV HEADER;
     COPY 995
    -
    - -
  • Fork the ICARDA AReS v1 repository to ILRI’s GitHub and give access to CodeObia guys

    - + - -

    2019-05-17

    - +
  • + +

    2019-05-17

    $ ./fix-metadata-values.py -i /tmp/2019-05-16-fix-306-Investors.csv -db dspace-u dspace-p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
     $ ./delete-metadata-values.py -i /tmp/2019-05-16-delete-297-Investors.csv -db dspace -u dspace -p 'fuuu' -m 29 -f dc.description.sponsorship -d
    -
    - -
  • Then I started a full Discovery re-indexing:

    - +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
     $ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    -
  • - -
  • I was going to make a new controlled vocabulary of the top 100 terms after these corrections, but I noticed a bunch of duplicates and variations when I sorted them alphabetically

  • - -
  • Instead, I exported a new list and asked Peter to look at it again

  • - -
  • Apply Peter’s new corrections on DSpace Test and CGSpace:

    - +
    $ ./fix-metadata-values.py -i /tmp/2019-05-17-fix-25-Investors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
     $ ./delete-metadata-values.py -i /tmp/2019-05-17-delete-14-Investors.csv -db dspace -u dspace -p 'fuuu' -m 29 -f dc.description.sponsorship -d
    -
  • - -
  • Then I re-exported the sponsors and took the top 100 to update the existing controlled vocabulary (#423)

    - + - -

    2019-05-19

    - +
  • + +

    2019-05-19

    - -

    2019-05-24

    - +

    2019-05-24

    - -

    2019-05-25

    - +

    2019-05-25

    - -

    2019-05-27

    - + +
  • Generate Simple Archive Format bundle with SAFBuilder and import into the AfricaRice Articles in Journals collection on CGSpace:
  • + +
    $ dspace import -a -e me@cgiar.org -m 2019-05-25-AfricaRice.map -s /tmp/SimpleArchiveFormat
    +

    2019-05-27

    $ ./fix-metadata-values.py -i /tmp/2019-05-27-fix-2472-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t corrections -d
    -
    - - -
  • Then start a full Discovery re-indexing on each server:

    - +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"                                   
     $ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    -
  • - -
  • Export new list of all authors from CGSpace database to send to Peter:

    - +
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-05-27-all-authors.csv with csv header;
     COPY 64871
    -
  • - -
  • Run all system updates on DSpace Test (linode19) and reboot it

  • - -
  • Paola from CIAT asked for a way to generate a report of the top keywords for each year of their articles and journals

    - + - -

    2019-05-29

    - +
  • + +

    2019-05-29

    - -

    2019-05-30

    - + + +

    2019-05-30

    - - +
    2019-05-30 07:19:35,166 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=A5E0C836AF8F3ABB769FE47107AE1CFF:ip_addr=185.71.4.34:failed_login:no DN found for user sa.saini@cgiar.org
    +
    +
    $ dspace user -a -m blah@blah.com -g Sakshi -s Saini -p 'sknflksnfksnfdls'
    +
    diff --git a/docs/2019-06/index.html b/docs/2019-06/index.html index d9646b6e3..1f1c336ef 100644 --- a/docs/2019-06/index.html +++ b/docs/2019-06/index.html @@ -8,14 +8,11 @@ @@ -27,17 +24,14 @@ Skype with Marie-Angélique and Abenet about CG Core v2 - + @@ -118,22 +112,17 @@ Skype with Marie-Angélique and Abenet about CG Core v2

    -

    2019-06-02

    - +

    2019-06-02

    - -

    2019-06-03

    - +

    2019-06-03

    - +
  • Marie agreed that we need to adopt some controlled lists for our values, and pointed out that the MARLO team maintains a list of CRPs and Centers at CLARISA -
  • - -

    2019-06-04

    - + + +

    2019-06-04

    +
  • Add Arabic language to input-forms.xml (#427), as Bioversity is adding some Arabic items and noticed it missing
  • - -

    2019-06-05

    - +

    2019-06-05

    - -

    2019-06-07

    - +

    2019-06-07

    - -

    2019-06-10

    - +
    statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher 
    +
    +

    2019-06-10

    + +
  • Generate a new list of countries from the database for use with reconcile-csv +
  • +
    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 228 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/countries.csv WITH CSV HEADER
     COPY 192
     $ csvcut -l -c 0 /tmp/countries.csv > 2019-06-10-countries.csv
    -
    - - -
  • Get a list of all the unique AGROVOC subject terms in IITA’s data and export it to a text file so I can validate them with my agrovoc-lookup.py script:

    - +
    $ csvcut -c dc.subject ~/Downloads/2019-06-10-IITA-20194th-Round-2.csv| sed 's/||/\n/g' | grep -v dc.subject | sort -u > iita-agrovoc.txt
     $ ./agrovoc-lookup.py -i iita-agrovoc.txt -om iita-agrovoc-matches.txt -or iita-agrovoc-rejects.txt
     $ wc -l iita-agrovoc*
    -402 iita-agrovoc-matches.txt
    -29 iita-agrovoc-rejects.txt
    -431 iita-agrovoc.txt
    -
  • - -
  • Combine these IITA matches with the subjects I matched a few months ago:

    - -
    $ csvcut -c name 2019-03-18-subjects-matched.csv | grep -v name | cat - iita-agrovoc-matches.txt | sort -u > 2019-06-10-subjects-matched.txt
    -
  • - -
  • Then make a new list to use with reconcile-csv by adding line numbers with csvcut and changing the line number header to id:

    - -
    $ csvcut -c name -l 2019-06-10-subjects-matched.txt | sed 's/line_number/id/' > 2019-06-10-subjects-matched.csv
    -
  • + 402 iita-agrovoc-matches.txt + 29 iita-agrovoc-rejects.txt + 431 iita-agrovoc.txt + - -

    2019-06-20

    - +
    $ csvcut -c name 2019-03-18-subjects-matched.csv | grep -v name | cat - iita-agrovoc-matches.txt | sort -u > 2019-06-10-subjects-matched.txt
    +
    +
    $ csvcut -c name -l 2019-06-10-subjects-matched.txt | sed 's/line_number/id/' > 2019-06-10-subjects-matched.csv
    +

    2019-06-20

    - -

    2019-06-23

    - +

    2019-06-23

    + +
  • Update my local PostgreSQL container:
  • +
    $ podman pull docker.io/library/postgres:9.6-alpine
     $ podman rm dspacedb
     $ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
    -
    - - -

    2019-06-25

    - +

    2019-06-25

    dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
     UPDATE 1551
     dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
     UPDATE 2070
     dspace=# UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('es', 'spa');
     UPDATE 2
    -
    - -
  • Upload 202 IITA records from earlier this month (20194th.xls) to CGSpace

  • - -
  • Communicate with Bioversity contractor in charge of their migration from Typo3 to CGSpace

  • + - -

    2019-06-28

    - +

    2019-06-28

    - -

    2019-06-30

    - + + +

    2019-06-30

    $ dspace import -a -e me@cgiar.org -m 2019-06-30-AfricaRice-11to73.map -s /tmp/2019-06-30-AfricaRice-11to73
    -
    - - -
  • I sent feedback about a few missing PDFs and one duplicate to Ibnou to check

  • - -
  • Run all system updates on DSpace Test (linode19) and reboot it

  • + - - + diff --git a/docs/2019-07/index.html b/docs/2019-07/index.html index 4a9fcb3c3..4450b6f75 100644 --- a/docs/2019-07/index.html +++ b/docs/2019-07/index.html @@ -8,14 +8,13 @@ @@ -27,17 +26,16 @@ Abenet had another similar issue a few days ago when trying to find the stats fo - + @@ -118,48 +116,40 @@ Abenet had another similar issue a few days ago when trying to find the stats fo

    -

    2019-07-01

    - +

    2019-07-01

    +
  • Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community
  • - - -

    Atmire CUA 2018 stats missing

    - + + +

    Atmire CUA 2018 stats missing

    org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2010': Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
    -
    - - -
  • I restarted Tomcat ten times and it never worked…

  • - -
  • I tried to stop Tomcat and delete the write locks:

    - +
    # systemctl stop tomcat7
     # find /dspace/solr/statistics* -iname "*.lock" -print -delete
     /dspace/solr/statistics/data/index/write.lock
    @@ -174,163 +164,131 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
     /dspace/solr/statistics-2018/data/index/write.lock
     # find /dspace/solr/statistics* -iname "*.lock" -print -delete
     # systemctl start tomcat7
    -
  • - -
  • But it still didn’t work!

  • - -
  • I stopped Tomcat, deleted the old locks, and will try to use the “simple” lock file type in solr/statistics/conf/solrconfig.xml:

    - -
    <lockType>${solr.lock.type:simple}</lockType>
    -
  • - -
  • And after restarting Tomcat it still doesn’t work

  • - -
  • Now I’ll try going back to “native” locking with unlockAtStartup:

    - -
    <unlockOnStartup>true</unlockOnStartup>
    -
  • - -
  • Now the cores seem to load, but I still see an error in the Solr Admin UI and I still can’t access any stats before 2018

  • - -
  • I filed an issue with Atmire, so let’s see if they can help

  • - -
  • And since I’m annoyed and it’s been a few months, I’m going to move the JVM heap settings that I’ve been testing on DSpace Test to CGSpace

  • - -
  • The old ones were:

    - -
    -Djava.awt.headless=true -Xms8192m -Xmx8192m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=5400 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false
    -
  • - -
  • And the new ones come from Solr 4.10.x’s startup scripts:

    - -
    -Djava.awt.headless=true
    --Xms8192m -Xmx8192m
    --Dfile.encoding=UTF-8
    --XX:NewRatio=3
    --XX:SurvivorRatio=4
    --XX:TargetSurvivorRatio=90
    --XX:MaxTenuringThreshold=8
    --XX:+UseConcMarkSweepGC
    --XX:+UseParNewGC
    --XX:ConcGCThreads=4 -XX:ParallelGCThreads=4
    --XX:+CMSScavengeBeforeRemark
    --XX:PretenureSizeThreshold=64m
    --XX:+UseCMSInitiatingOccupancyOnly
    --XX:CMSInitiatingOccupancyFraction=50
    --XX:CMSMaxAbortablePrecleanTime=6000
    --XX:+CMSParallelRemarkEnabled
    --XX:+ParallelRefProcEnabled
    --Dcom.sun.management.jmxremote
    --Dcom.sun.management.jmxremote.port=1337
    --Dcom.sun.management.jmxremote.ssl=false
    --Dcom.sun.management.jmxremote.authenticate=false
    -
  • + - -

    2019-07-02

    - +
    <lockType>${solr.lock.type:simple}</lockType>
    +
    +
    <unlockOnStartup>true</unlockOnStartup>
    +
    +
    -Djava.awt.headless=true -Xms8192m -Xmx8192m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=5400 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false
    +
    +
        -Djava.awt.headless=true
    +    -Xms8192m -Xmx8192m
    +    -Dfile.encoding=UTF-8
    +    -XX:NewRatio=3
    +    -XX:SurvivorRatio=4
    +    -XX:TargetSurvivorRatio=90
    +    -XX:MaxTenuringThreshold=8
    +    -XX:+UseConcMarkSweepGC
    +    -XX:+UseParNewGC
    +    -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4
    +    -XX:+CMSScavengeBeforeRemark
    +    -XX:PretenureSizeThreshold=64m
    +    -XX:+UseCMSInitiatingOccupancyOnly
    +    -XX:CMSInitiatingOccupancyFraction=50
    +    -XX:CMSMaxAbortablePrecleanTime=6000
    +    -XX:+CMSParallelRemarkEnabled
    +    -XX:+ParallelRefProcEnabled
    +    -Dcom.sun.management.jmxremote
    +    -Dcom.sun.management.jmxremote.port=1337
    +    -Dcom.sun.management.jmxremote.ssl=false
    +    -Dcom.sun.management.jmxremote.authenticate=false
    +

    2019-07-02

    $ sed -i 's/CC-BY 4.0/CC-BY-4.0/' item_*/dublin_core.xml
     $ echo "10568/101992" >> item_*/collections
     $ dspace import -a -e me@cgiar.org -m 2019-07-02-Sharefair.map -s /tmp/Sharefair_mapped
    -
    - - -
  • I noticed that all twenty-seven items had double dates like “2019-05||2019-05” so I fixed those, but the rest of the metadata looked good so I unmapped them from the temporary collection

  • - -
  • Finish looking at the fifty-six AfricaRice items and upload them to CGSpace:

    - +
    $ dspace import -a -e me@cgiar.org -m 2019-07-02-AfricaRice-11to73.map -s /tmp/SimpleArchiveFormat
    -
  • - -
  • Peter pointed out that the Sharefair dates I fixed were not actually fixed

    - + - -

    2019-07-03

    - +
  • + +

    2019-07-03

    - -

    2019-07-04

    - +

    2019-07-04

    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/new-bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2019-07-04-orcid-ids.txt
     $ ./resolve-orcids.py -i /tmp/2019-07-04-orcid-ids.txt -o 2019-07-04-orcid-names.txt -d
    -
    - - -
  • Send and merge a pull request for the new ORCID identifiers (#428)

  • - -
  • I created a CSV with some ORCID identifiers that I had seen change so I could update any existing ones in the databse:

    - +
    cg.creator.id,correct
     "Marius Ekué: 0000-0002-5829-6321","Marius R.M. Ekué: 0000-0002-5829-6321"
     "Mwungu: 0000-0001-6181-8445","Chris Miyinzi Mwungu: 0000-0001-6181-8445"
     "Mwungu: 0000-0003-1658-287X","Chris Miyinzi Mwungu: 0000-0003-1658-287X"
    -
  • - -
  • But when I ran fix-metadata-values.py I didn’t see any changes:

    - -
    $ ./fix-metadata-values.py -i 2019-07-04-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
    -
  • + - -

    2019-07-06

    - +
    $ ./fix-metadata-values.py -i 2019-07-04-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
    +

    2019-07-06

    - -

    2019-07-08

    - +

    2019-07-08

    +
  • Meeting with AgroKnow and CTA about their new ICT Update story telling thing -
  • - -
  • Playing with the idea of using xsv to do some basic batch quality checks on CSVs, for example to find items that might be duplicates if they have the same DOI or title:

    - + +
  • +
  • Playing with the idea of using xsv to do some basic batch quality checks on CSVs, for example to find items that might be duplicates if they have the same DOI or title:
  • +
    $ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1'
     field,value,count
     cg.identifier.doi,https://doi.org/10.1016/j.agwat.2018.06.018,2
     $ xsv frequency --select dc.title --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1'         
     field,value,count
     dc.title,Reference evapotranspiration prediction using hybridized fuzzy model with firefly algorithm: Regional case study in Burkina Faso,2
    -
    - -
  • Or perhaps if DOIs are valid or not (having doi.org in the URL):

    - +
    $ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E 'doi.org'
     field,value,count
     cg.identifier.doi,https://hdl.handle.net/10520/EJC-1236ac700f,1
    -
  • - -
  • Or perhaps items with invalid ISSNs (according to the ISSN code format):

    - +
    $ xsv select dc.identifier.issn cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v '"' | grep -v -E '^[0-9]{4}-[0-9]{3}[0-9xX]$'
     dc.identifier.issn
     978-3-319-71997-9
    @@ -338,86 +296,69 @@ dc.identifier.issn
     978-3-319-71997-9
     978-3-319-58789-9
     2320-7035 
    -2593-9173
    -
  • - - -

    2019-07-09

    - + 2593-9173 +

    2019-07-09

    - -

    2019-07-11

    - + + +

    2019-07-11

    +
  • Skype call with Jane Poole to discuss OpenRXV/AReS Phase II TORs -
  • -
  • Yesterday Theirry from CTA asked me about an error he was getting while submitting an item on CGSpace: “Unable to load Submission Information, since WorkspaceID (ID:S106658) is not a valid in-process submission.”
  • - -
  • I looked in the DSpace logs and found this right around the time of the screenshot he sent me:

    - -
    2019-07-10 11:50:27,433 INFO  org.dspace.submit.step.CompleteStep @ lewyllie@cta.int:session_id=A920730003BCAECE8A3B31DCDE11A97E:submission_complete:Completed submission with id=106658
    -
  • - -
  • I’m assuming something happened in his browser (like a refresh) after the item was submitted…

  • - -

    2019-07-12

    - + +
  • Yesterday Theirry from CTA asked me about an error he was getting while submitting an item on CGSpace: “Unable to load Submission Information, since WorkspaceID (ID:S106658) is not a valid in-process submission.”
  • +
  • I looked in the DSpace logs and found this right around the time of the screenshot he sent me:
  • + +
    2019-07-10 11:50:27,433 INFO  org.dspace.submit.step.CompleteStep @ lewyllie@cta.int:session_id=A920730003BCAECE8A3B31DCDE11A97E:submission_complete:Completed submission with id=106658
    +
    +

    2019-07-12

    +
  • Run all system updates on DSpace Test (linode19) and reboot it
  • - -
  • Try to run dspace cleanup -v on CGSpace and ran into an error:

    - +
  • Try to run dspace cleanup -v on CGSpace and ran into an error:
  • +
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -Detail: Key (bitstream_id)=(167394) is still referenced from table "bundle".
    -
    - -
  • The solution is, as always:

    - + Detail: Key (bitstream_id)=(167394) is still referenced from table "bundle". +
    # su - postgres
     $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (167394);'
     UPDATE 1
    -
  • - - -

    2019-07-16

    - +

    2019-07-16

    $ podman system prune -a -f --volumes
     $ sudo rm -rf ~/.local/share/containers
    -
    - -
  • Then pull a new PostgreSQL 9.6 image and load a CGSpace database dump into a new local test container:

    - +
    $ podman pull postgres:9.6-alpine
     $ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
     $ createuser -h localhost -U postgres --pwprompt dspacetest
    @@ -426,108 +367,91 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
     $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2019-07-16.backup
     $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'                     
     $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
    -
  • - -
  • Start working on implementing the CG Core v2 changes on my local DSpace test environment

  • - -
  • Make a pull request to CG Core v2 with some fixes for typos in the specification (#5)

  • + - -

    2019-07-18

    - +

    2019-07-18

    + +
  • Sisay said a user was having problems registering on CGSpace and it looks like the email account expired again:
  • +
    $ dspace test-email
     
     About to send test email:
    -- To: blahh@cgiar.org
    -- Subject: DSpace test email
    -- Server: smtp.office365.com
    + - To: blahh@cgiar.org
    + - Subject: DSpace test email
    + - Server: smtp.office365.com
     
     Error sending email:
    -- Error: javax.mail.AuthenticationFailedException
    + - Error: javax.mail.AuthenticationFailedException
     
     Please see the DSpace documentation for assistance.
    -
    - -
  • I emailed ICT to ask them to reset it and make the expiration period longer if possible

  • + - -

    2019-07-19

    - +

    2019-07-19

    - -

    2019-07-20

    - + + +

    2019-07-20

    $ dspace user --add --givenname Lionelle --surname Samnick --email blah@blah.com --password 'blah'
    -
    - -
  • I added her as a submitter to CTA ISF Pro-Agro series

  • - -
  • Start looking at 1429 records for the Bioversity batch import

    - + - -

    2019-07-22

    - - - -

    2019-07-25

    - +

    2019-07-22

    +
        <dct:coverage>
    +        <dct:spatial>
    +            <type>Country</type>
    +            <dct:identifier>http://sws.geonames.org/192950</dct:identifier>
    +            <rdfs:label>Kenya</rdfs:label>
    +        </dct:spatial>
    +    </dct:coverage>
    +
    +

    2019-07-25

    + +
  • +
  • +

    I might be able to use isbnlib to validate ISBNs in Python:

    +
  • +
    if isbnlib.is_isbn10('9966-955-07-0') or isbnlib.is_isbn13('9966-955-07-0'):
    -print("Yes")
    +    print("Yes")
     else:
    -print("No")
    -
    - -
  • Or with python-stdnum:

    - + print("No") +
    from stdnum import isbn
     from stdnum import issn
     
     isbn.validate('978-92-9043-389-7')
     issn.validate('1020-3362')
    -
  • - - -

    2019-07-26

    - +

    2019-07-26

    - -

    2019-07-29

    - + +
  • +

    I figured out a GREL to trim spaces in multi-value cells without splitting them:

    +
  • + +
    value.replace(/\s+\|\|/,"||").replace(/\|\|\s+/,"||")
    +
    +

    2019-07-29

    +
  • Inform Bioversity that there is an error in their CSV, seemingly caused by quotes in the citation field
  • - -

    2019-07-30

    - +

    2019-07-30

    - - + diff --git a/docs/2019-08/index.html b/docs/2019-08/index.html index 3756eb186..255373483 100644 --- a/docs/2019-08/index.html +++ b/docs/2019-08/index.html @@ -8,19 +8,16 @@ @@ -33,23 +30,20 @@ Run system updates on DSpace Test (linode19) and reboot it - + @@ -130,473 +124,416 @@ Run system updates on DSpace Test (linode19) and reboot it

    -

    2019-08-03

    - +

    2019-08-03

    - -

    2019-08-04

    - +

    2019-08-04

    +
  • Run system updates on DSpace Test (linode19) and reboot it
  • - -

    2019-08-05

    - +

    2019-08-05

    + + +
    or(
    +  isNotNull(value.match(/^.*’.*$/)),
    +  isNotNull(value.match(/^.*é.*$/)),
    +  isNotNull(value.match(/^.*á.*$/)),
    +  isNotNull(value.match(/^.*è.*$/)),
    +  isNotNull(value.match(/^.*í.*$/)),
    +  isNotNull(value.match(/^.*ó.*$/)),
    +  isNotNull(value.match(/^.*ú.*$/)),
    +  isNotNull(value.match(/^.*à.*$/)),
    +  isNotNull(value.match(/^.*û.*$/))
    +).toString()
    +
    +

    2019-08-06

    - -

    2019-08-07

    - + + +

    2019-08-07

    - -

    2019-08-08

    - +

    2019-08-08

    # /opt/certbot-auto renew --standalone --pre-hook "/usr/bin/docker stop angular_nginx; /bin/systemctl stop firewalld" --post-hook "/bin/systemctl start firewalld; /usr/bin/docker start angular_nginx"
    -
    - - -
  • It is important that the firewall starts back up before the Docker container or else Docker will complain about missing iptables chains

  • - -
  • Also, I updated to the latest TLS Intermediate settings as appropriate for Ubuntu 18.04’s OpenSSL 1.1.0g with nginx 1.16.0

  • - -
  • Run all system updates on AReS dev server (linode20) and reboot it

  • - -
  • Get a list of all PDFs from the Bioversity migration that fail to download and save them so I can try again with a different path in the URL:

    - +
    $ ./generate-thumbnails.py -i /tmp/2019-08-05-Bioversity-Migration.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs.txt
     $ grep -B1 "Download failed" /tmp/2019-08-08-download-pdfs.txt | grep "Downloading" | sed -e 's/> Downloading //' -e 's/\.\.\.//' | sed -r 's/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g' | csvcut -H -c 1,1 > /tmp/user-upload.csv
     $ ./generate-thumbnails.py -i /tmp/user-upload.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs2.txt
     $ grep -B1 "Download failed" /tmp/2019-08-08-download-pdfs2.txt | grep "Downloading" | sed -e 's/> Downloading //' -e 's/\.\.\.//' | sed -r 's/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g' | csvcut -H -c 1,1 > /tmp/user-upload2.csv
     $ ./generate-thumbnails.py -i /tmp/user-upload2.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs3.txt
    -
  • - -
  • (the weird sed regex removes color codes, because my generate-thumbnails script prints pretty colors)

  • - -
  • Some PDFs are uploaded in different paths so I have to try a few times to get them all:

    - + +
  • +
  • +

    Even so, there are still 52 items with incorrect filenames, so I can't derive their PDF URLs…

  • - -
  • I will proceed with a metadata-only upload first and then let them know about the missing PDFs

  • - -
  • Troubleshoot an issue we had with proxying to the new development version of AReS from DSpace Test (linode19)

    - + +
  • +
  • +

    I will proceed with a metadata-only upload first and then let them know about the missing PDFs

    +
  • +
  • +

    Troubleshoot an issue we had with proxying to the new development version of AReS from DSpace Test (linode19)

  • - -
  • Though I am really wondering why this happened now, because the configuration has been working for months…

  • - -
  • Improve the output of the suspicious characters check in csv-metadata-quality script and tag version 0.2.0

  • +
  • The solution is to set the host header when proxy passing:
  • - -

    2019-08-09

    - + + +
    proxy_set_header Host dev.ares.codeobia.com;
    +
    +

    2019-08-09

    - -

    2019-08-10

    - + + +

    2019-08-10

    - -

    2019-08-12

    - +

    2019-08-12

    + + +

    2019-08-13

    $ dspace user -a -m blah@blah.com -g Mohammad -s Salem -p 'domoamaaa'
    -
    - -
  • Create and merge a pull request (#429) to add eleven new CCAFS Phase II Project Tags to CGSpace

  • - -
  • Atmire responded to the Solr cores issue last week, but they could not reproduce the issue

    - + +
  • +
  • Testing an import of 1,429 Bioversity items (metadata only) on my local development machine and got an error with Java memory after about 1,000 items:
  • +
    $ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com
     ...
     java.lang.OutOfMemoryError: GC overhead limit exceeded
    -
    - -
  • I increased the heap size to 1536m and tried again:

    - +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1536m"
     $ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com
    -
  • - -
  • This time it succeeded, and using VisualVM I noticed that the import process used a maximum of 620MB of RAM

  • - -
  • (oops, I realize that actually I forgot to delete items I had flagged as duplicates, so the total should be 1,427 items)

  • + - -

    2019-08-14

    - +

    2019-08-14

    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx512m'
     $ dspace metadata-import -f /tmp/bioversity1.csv -e blah@blah.com
     $ dspace metadata-import -f /tmp/bioversity2.csv -e blah@blah.com
    -
    - - -
  • The next step is to check these items for duplicates

  • + - -

    2019-08-16

    - +

    2019-08-16

    - -

    2019-08-18

    - +

    2019-08-18

    - -

    2019-08-20

    - +
    statistics-2015: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
    +
    +

    2019-08-20

    import os
     
     return os.path.basename(value)
    -
    - - -
  • Then I can try to download all the files again with the script

  • - -
  • I also asked Francesco about the strange filenames (.LCK, .zip, and .7z)

  • + - -

    2019-08-21

    - +

    2019-08-21

    - -

    2019-08-22

    - + + +

    2019-08-22

    - -

    2019-08-23

    - +

    2019-08-23

    - -

    2019-08-26

    - +

    2019-08-26

    $ ./fix-metadata-values.py -i ~/Downloads/2019-08-26-Peter-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct
    -
    - - -
  • Apply the corrections on CGSpace and DSpace Test

    - +
    $ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    81m47.057s 
     user    8m5.265s 
     sys     2m24.715s
    -
  • - - -
  • Peter asked me to add related citation aka cg.link.citation to the item view

    - + - -

    2019-08-27

    - +
  • +
  • +

    Add the ability to skip certain fields from the csv-metadata-quality script using --exclude-fields

    + +
  • + +

    2019-08-27

    +
  • Add a fix for missing space after commas to my csv-metadata-quality script and tag version 0.2.2
  • - -

    2019-08-28

    - +

    2019-08-28

    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-08-28-all-authors.csv with csv header;
     COPY 65597
    -
    - - -
  • Then I created a new CSV with two author columns (edit title of second column after):

    - -
    $ csvcut -c dc.contributor.author,dc.contributor.author /tmp/2019-08-28-all-authors.csv > /tmp/all-authors.csv
    -
  • - -
  • Then I ran my script on the new CSV, skipping one of the author columns:

    - -
    $ csv-metadata-quality -u -i /tmp/all-authors.csv -o /tmp/authors.csv -x dc.contributor.author
    -
  • - -
  • This fixed a bunch of issues with spaces, commas, unneccesary Unicode characters, etc

  • - -
  • Then I ran the corrections on my test server and there were 185 of them!

    - -
    $ ./fix-metadata-values.py -i /tmp/authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correctauthor
    -
  • - -
  • I very well might run these on CGSpace soon…

  • + - -

    2019-08-29

    - +
    $ csvcut -c dc.contributor.author,dc.contributor.author /tmp/2019-08-28-all-authors.csv > /tmp/all-authors.csv
    +
    +
    $ csv-metadata-quality -u -i /tmp/all-authors.csv -o /tmp/authors.csv -x dc.contributor.author
    +
    +
    $ ./fix-metadata-values.py -i /tmp/authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correctauthor
    +
    +

    2019-08-29

    $ find dspace/modules/xmlui-mirage2/src/main/webapp/themes -iname "*.xsl" -exec ./cgcore-xsl-replacements.sed {} \;
    -
    - - -
  • I think I got everything in the XMLUI themes, but there may be some things I should check once I get a deployment up and running:

    - + +
  • +
  • Thierry Lewadle asked why some PDFs on CGSpace open in the browser and some download
  • - -
  • Peter asked why an item on CGSpace has no Altmetric donut on the item view, but has one in our explorer

    - -
  • - -
  • So this is the same issue we had before, where Altmetric knows this Handle is associated with a DOI that has a score, but the client-side JavaScript code doesn’t show it because it seems to a secondary handle or something

  • - -

    2019-08-31

    - + +
  • Peter asked why an item on CGSpace has no Altmetric donut on the item view, but has one in our explorer + +
  • + +
    "handles":["10986/30568","10568/97825"],"handle":"10986/30568"
    +
    +

    2019-08-31

    $ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
      
     real    90m47.967s
     user    8m12.826s
     sys     2m27.496s
    -
    - -
  • I set up a test environment for CG Core v2 on my local environment and ran all the field migrations

    - + - - +
  • + + diff --git a/docs/2019-09/index.html b/docs/2019-09/index.html index aae6a7a0b..df009550f 100644 --- a/docs/2019-09/index.html +++ b/docs/2019-09/index.html @@ -8,34 +8,31 @@ @@ -46,36 +43,33 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning: - + @@ -156,158 +150,136 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:

    -

    2019-09-01

    - +

    2019-09-01

    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -440 17.58.101.255
    -441 157.55.39.101
    -485 207.46.13.43
    -728 169.60.128.125
    -730 207.46.13.108
    -758 157.55.39.9
    -808 66.160.140.179
    -814 207.46.13.212
    -2472 163.172.71.23
    -6092 3.94.211.189
    +    440 17.58.101.255
    +    441 157.55.39.101
    +    485 207.46.13.43
    +    728 169.60.128.125
    +    730 207.46.13.108
    +    758 157.55.39.9
    +    808 66.160.140.179
    +    814 207.46.13.212
    +   2472 163.172.71.23
    +   6092 3.94.211.189
     # zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    - 33 2a01:7e00::f03c:91ff:fe16:fcb
    - 57 3.83.192.124
    - 57 3.87.77.25
    - 57 54.82.1.8
    -822 2a01:9cc0:47:1:1a:4:0:2
    -1223 45.5.184.72
    -1633 172.104.229.92
    -5112 205.186.128.185
    -7249 2a01:7e00::f03c:91ff:fe18:7396
    -9124 45.5.186.2
    -
    - - - +

    2019-09-10

    +
  • Follow up with Bosede about the mixup with PDFs in the items uploaded in 2018-12 (aka Daniel1807.xsl) -
  • + +
  • Continue working on CG Core v2 migration, focusing on the crosswalk mappings -
  • - -

    2019-09-11

    - + + +

    2019-09-11

    +
  • More work on the CG Core v2 migrations -
  • - -

    2019-09-12

    - + + +

    2019-09-12

    - -

    2019-09-15

    - +

    2019-09-15

    +
  • Update nginx TLS cipher suite to the latest Mozilla intermediate recommendations for nginx 1.16.0 and openssl 1.0.2 -
  • - -
  • XMLUI item view pages are blank on CGSpace right now

    - + +
  • +
  • XMLUI item view pages are blank on CGSpace right now +
  • +
    2019-09-15 15:32:18,137 WARN  org.apache.cocoon.components.xslt.TraxErrorListener  - Can not load requested doc: unknown protocol: cocoon at jndi:/localhost/themes/CIAT/xsl/../../0_CGIAR/xsl//aspect/artifactbrowser/common.xsl:141:90
    -
    - - -
  • Around the same time I see the following in the DSpace log:

    - +
    2019-09-15 15:32:18,079 INFO  org.dspace.usage.LoggerUsageEventListener @ aorth@blah:session_id=A11C362A7127004C24E77198AF9E4418:ip_addr=x.x.x.x:view_item:handle=10568/103644 
     2019-09-15 15:32:18,135 WARN  org.dspace.core.PluginManager @ Cannot find named plugin for interface=org.dspace.content.crosswalk.DisseminationCrosswalk, name="METSRIGHTS"
    -
  • - -
  • I see a lot of these errors today, but not earlier this month:

    - +
    # grep -c 'Cannot find named plugin' dspace.log.2019-09-*
     dspace.log.2019-09-01:0
     dspace.log.2019-09-02:0
    @@ -324,27 +296,23 @@ dspace.log.2019-09-12:0
     dspace.log.2019-09-13:0
     dspace.log.2019-09-14:0
     dspace.log.2019-09-15:808
    -
  • - -
  • Something must have happened when I restarted Tomcat a few hours ago, because earlier in the DSpace log I see a bunch of errors like this:

    - +
    2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class="org.dspace.content.crosswalk.METSRightsCrosswalk", name="METSRIGHTS"
     2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class="org.dspace.content.crosswalk.OREDisseminationCrosswalk", name="ore"
     2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class="org.dspace.content.crosswalk.DIMDisseminationCrosswalk", name="dim"
    -
  • - -
  • I restarted Tomcat and the item views came back, but then the Solr statistics cores didn’t all load properly

    - + - -

    2019-09-19

    - +
  • + +

    2019-09-19

    # docker pull docker.io/library/postgres:9.6-alpine
     # docker create volume dspacedb_data
     # docker run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
    @@ -354,15 +322,14 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
     $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2019-08-31.backup
     $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
     $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
    -
    - -
  • Elizabeth from CIAT sent me a list of sixteen authors who need to have their ORCID identifiers tagged with their publications

    - +
    dc.contributor.author,cg.creator.id
     "Kihara, Job","Job Kihara: 0000-0002-4394-9553"
     "Twyman, Jennifer","Jennifer Twyman: 0000-0002-8581-5668"
    @@ -380,242 +347,212 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
     "Tamene, Lulseged","Lulseged Tamene: 0000-0002-3806-8890"
     "Andrieu, Nadine","Nadine Andrieu: 0000-0001-9558-9302"
     "Ramírez-Villegas, Julián","Julian Ramirez-Villegas: 0000-0002-8044-583X"
    -
  • - - -
  • I tested the file on my local development machine with the following invocation:

    - +
    $ ./add-orcid-identifiers-csv.py -i 2019-09-19-ciat-orcids.csv -db dspace -u dspace -p 'fuuu'
    -
  • - -
  • In my test environment this added 390 ORCID identifier

  • - -
  • I ran the same updates on CGSpace and DSpace Test and then started a Discovery re-index to force the search index to update

  • - -
  • Update the PostgreSQL JDBC driver to version 42.2.8 in our Ansible infrastructure scripts

    - + +
  • +
  • Run system updates on DSpace Test (linode19) and reboot it
  • +
  • Start looking at IITA's latest round of batch updates that Sisay had uploaded to DSpace Test earlier this month - -

    2019-09-20

    - +
  • +
  • I also looked through the IITA subjects to normalize some values
  • + + +
  • Follow up with Marissa again about the CCAFS phase II project tags
  • +
  • Generate a list of the top 1500 authors on CGSpace:
  • + +
    dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = (SELECT metadata_field_id FROM metadatafieldregistry WHERE element = 'contributor' AND qualifier = 'author') AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-09-19-top-1500-authors.csv WITH CSV HEADER;
    +
    +
    $ csvcut -c text_value /tmp/2019-09-19-top-1500-authors.csv | grep -v text_value | sed 's/"//g' | sort > dspace/config/controlled-vocabularies/dc-contributor-author.xml
    +
    +
    $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/dc-contributor-author.xml
    +
    +

    2019-09-20

    +
    $ perl-rename -n 's/_{2,3}/_/g' *.pdf
    -
    - - -
  • I was going preparing to run SAFBuilder for the Bioversity migration and decided to check the list of PDFs on my local machine versus on DSpace Test (where I had downloaded them last month)

    - +
    $ rename -v 's/___/_/g'  *.pdf
     $ rename -v 's/__/_/g'  *.pdf
    -
  • - - -
  • I’m still waiting to hear what Carol and Francesca want to do with the 1195.pdf.LCK file (for now I’ve removed it from the CSV, but for future reference it has the number 630 in its permalink)

  • - -
  • I wrote two fairly long GREL expressions to clean up the institutional author names in the dc.contributor.author and dc.identifier.citation fields using OpenRefine

    - +
    value.replace(/,? ?\((ANDES|APAFRI|APFORGEN|Canada|CFC|CGRFA|China|CacaoNet|CATAS|CDU|CIAT|CIRF|CIP|CIRNMA|COSUDE|Colombia|COA|COGENT|CTDT|Denmark|DfLP|DSE|ECPGR|ECOWAS|ECP\/GR|England|EUFORGEN|FAO|France|Francia|FFTC|Germany|GEF|GFU|GGCO|GRPI|italy|Italy|Italia|India|ICCO|ICAR|ICGR|ICRISAT|IDRC|INFOODS|IPGRI|IBPGR|ICARDA|ILRI|INIBAP|INBAR|IPK|ISG|IT|Japan|JIRCAS|Kenya|LI\-BIRD|Malaysia|NARC|NBPGR|Nepal|OOAS|RDA|RISBAP|Rome|ROPPA|SEARICE|Senegal|SGRP|Sweden|Syrian Arab Republic|The Netherlands|UNDP|UK|UNEP|UoB|UoM|United Kingdom|WAHO)\)/,"")
    -
  • - -
  • The second targets cities and countries after names like “International Livestock Research Intstitute, Kenya”:

    - +
    replace(/,? ?(ali|Aleppo|Amsterdam|Beijing|Bonn|Burkina Faso|CN|Dakar|Gatersleben|London|Montpellier|Nairobi|New Delhi|Kaski|Kepong|Malaysia|Khumaltar|Lima|Ltpur|Ottawa|Patancheru|Peru|Pokhara|Rome|Uppsala|University of Mauritius|Tsukuba)/,"")
    -
  • - - -
  • I imported the 1,427 Bioversity records with bitstreams to a new collection called 2019-09-20 Bioversity Migration Test on DSpace Test (after splitting them in two batches of about 700 each):

    - +
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx768m'
     $ dspace import -a me@cgiar.org -m 2019-09-20-bioversity1.map -s /home/aorth/Bioversity/bioversity1
     $ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bioversity/bioversity2
    -
  • - -
  • After that I exported the collection again and started doing some quality checks and cleanups:

    - + +
  • +
  • The next steps are:
  • - -

    2019-09-21

    - + + +

    2019-09-21

    +
  • Play with language identification using the langdetect, fasttext, polyglot, and langid libraries -
  • -
  • I added very experimental language detection to the csv-metadata-quality module - -
  • - -

    2019-09-24

    - + +
  • I added very experimental language detection to the csv-metadata-quality module + +
  • + +

    2019-09-24

    - -

    2019-09-26

    - + + +

    2019-09-26

    +
  • Give more feedback to Bosede about the IITA Sept 6 (20196th.xls) records on DSpace Test -
  • - -
  • Get a list of institutions from CCAFS’s Clarisa API and try to parse it with jq, do some small cleanups and add a header in sed, and then pass it through csvcut to add line numbers:

    - + +
  • +
  • Get a list of institutions from CCAFS's Clarisa API and try to parse it with jq, do some small cleanups and add a header in sed, and then pass it through csvcut to add line numbers:
  • +
    $ cat ~/Downloads/institutions.json| jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/"//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' > /tmp/clarisa-institutions.csv
     $ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institutions-cleaned.csv -u
    -
    - -
  • The csv-metadata-quality tool caught a few records with excessive spacing and unnecessary Unicode

  • - -
  • I could potentially use this with reconcile-csv and OpenRefine as a source to validate our institutional authors against…

  • + - -

    2019-09-27

    - +

    2019-09-27

    +
  • The other alternative is to just keep using the names we have, which are mostly compliant with AGROVOC
  • + +
  • Peter said that a new server for DSpace Test is fine, so I can proceed with the normal process of getting approval from Michael Victor and ICT when I have time (recommend moving from $40 to $80/month Linode, with 16GB RAM)
  • I need to ask Atmire for a quote to upgrade CGSpace to DSpace 6 with all current modules so we can see how many more credits we need
  • - + +
  • A little bit more work on the Sept 6 IITA batch records -
  • - - + + + diff --git a/docs/2019-10/index.html b/docs/2019-10/index.html index bb77f66f5..c0a512321 100644 --- a/docs/2019-10/index.html +++ b/docs/2019-10/index.html @@ -6,8 +6,7 @@ - + @@ -15,9 +14,8 @@ - - + + @@ -98,159 +96,125 @@

    - - -

    2019-10-01

    - +

    2019-10-01

    - -

    2019-10-03

    - + + +
    $ csvcut -c 'id,dc.title[en_US],cg.coverage.region[en_US],cg.coverage.subregion[en_US],cg.river.basin[en_US]' ~/Downloads/10568-16814.csv > /tmp/iwmi-title-region-subregion-river.csv
    +
    +
    $ csv-metadata-quality -i ~/Downloads/10568-16814.csv -o /tmp/iwmi.csv -x 'dc.date.issued,dc.date.issued[],dc.date.issued[en_US]' -u
    +
    +

    2019-10-03

    - -

    2019-10-04

    - +

    2019-10-04

    $ dspace user -a -m blah@mail.it -g Francesco -s Vernocchi -p 'fffff'
    -
    - -
  • Email Francesca and Carol to ask for follow up about the test upload I did on 2019-09-21

    - + - -

    2019-10-06

    - +
  • + +

    2019-10-06

    +
  • Gabriela from CIP asked me if it was possible to generate an RSS feed of items that have the CIP subject “POTATO AGRI-FOOD SYSTEMS” -
  • - -

    2019-10-08

    - + + +

    2019-10-08

    - -

    2019-10-09

    - + +
  • Start looking at duplicates in the Bioversity migration data on DSpace Test + +
  • + +

    2019-10-09

    +
  • Run all system updates on DSpace Test (linode19) and reboot the server
  • - -

    2019-10-10

    - +

    2019-10-10

    - -

    2019-10-11

    - + + +
    $ dspace user -a -m wow@me.com -g Felix -s Shaw -p 'fuananaaa'
    +

    2019-10-11

    $ dspace cleanup -v
     ...
     Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    -Detail: Key (bitstream_id)=(171221) is still referenced from table "bundle".
    -
    - -
  • The solution, as always, is (repeat as many times as needed):

    - + Detail: Key (bitstream_id)=(171221) is still referenced from table "bundle". +
    # su - postgres
     $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (171221);'
     UPDATE 1
    -
  • - - -

    2019-10-12

    - +

    2019-10-12

    + +
  • I was preparing to check the affiliations on the Bioversity records when I noticed that the last list of top affiliations I generated has some anomalies +
  • +
    from,to
     CIAT,International Center for Tropical Agriculture
     International Centre for Tropical Agriculture,International Center for Tropical Agriculture
    @@ -259,170 +223,139 @@ International Centre for Agricultural Research in the Dry Areas,International Ce
     International Maize and Wheat Improvement Centre,International Maize and Wheat Improvement Center
     "Agricultural Information Resource Centre, Kenya.","Agricultural Information Resource Centre, Kenya"
     "Centre for Livestock and Agricultural Development, Cambodia","Centre for Livestock and Agriculture Development, Cambodia"
    -
    - - -
  • Then I applied it with my fix-metadata-values.py script on CGSpace:

    - +
    $ ./fix-metadata-values.py -i /tmp/affiliations.csv -db dspace -u dspace -p 'fuuu' -f from -m 211 -t to
    -
  • - -
  • I did some manual curation of about 300 authors in OpenRefine in preparation for telling Peter and Abenet that the migration is almost ready

    - + - -

    2019-10-13

    - +
  • + +

    2019-10-13

    + +
  • Peter is still seeing some authors listed with “|” in the “Top Authors” statistics for some collections +
  • +
    $ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    82m35.993s
    -
    - - -
  • After the re-indexing the top authors still list the following:

    - -
    Jagwe, J.|Ouma, E.A.|Brandes-van Dorresteijn, D.|Kawuma, Brian|Smith, J.
    -
  • - -
  • I looked in the database to find authors that had “|” in them:

    - -
    dspace=# SELECT text_value, resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value LIKE '%|%';
    -        text_value            | resource_id 
    -----------------------------------+-------------
    -Anandajayasekeram, P.|Puskur, R. |         157
    -Morales, J.|Renner, I.           |       22779
    -Zahid, A.|Haque, M.A.            |       25492
    -(3 rows)
    -
  • - -
  • Then I found their handles and corrected them, for example:

    - -
    dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '157' and handle.resource_type_id=2;
    -handle   
    ------------
    -10568/129
    -(1 row)
    -
  • - -
  • So I’m still not sure where these weird authors in the “Top Author” stats are coming from

  • + - -

    2019-10-14

    - +
    Jagwe, J.|Ouma, E.A.|Brandes-van Dorresteijn, D.|Kawuma, Brian|Smith, J.
    +
    +
    dspace=# SELECT text_value, resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value LIKE '%|%';
    +            text_value            | resource_id 
    +----------------------------------+-------------
    + Anandajayasekeram, P.|Puskur, R. |         157
    + Morales, J.|Renner, I.           |       22779
    + Zahid, A.|Haque, M.A.            |       25492
    +(3 rows)
    +
    +
    dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '157' and handle.resource_type_id=2;
    +  handle   
    +-----------
    + 10568/129
    +(1 row)
    +
    +

    2019-10-14

    - -

    2019-10-15

    - + + +

    2019-10-15

    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx512m'
     $ mkdir 2019-10-15-Bioversity
     $ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-10-15-Bioversity
     $ sed -i '/<dcvalue element="identifier" qualifier="uri">/d' 2019-10-15-Bioversity/*/dublin_core.xml
    -
    - - -
  • It’s really stupid, but for some reason the handles are included even though I specified the -m option, so after the export I removed the dc.identifier.uri metadata values from the items

  • - -
  • Then I imported a test subset of them in my local test environment:

    - +
    $ ~/dspace/bin/dspace import -a -c 10568/104049 -e fuu@cgiar.org -m 2019-10-15-Bioversity.map -s /tmp/2019-10-15-Bioversity
    -
  • - -
  • I had forgotten (again) that the dspace export command doesn’t preserve collection ownership or mappings, so I will have to create a temporary collection on CGSpace to import these to, then do the mappings again after import…

  • - -
  • On CGSpace I will increase the RAM of the command line Java process for good luck before import…

    - +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
     $ dspace import -a -c 10568/104057 -e fuu@cgiar.org -m 2019-10-15-Bioversity.map -s 2019-10-15-Bioversity
    -
  • - -
  • After importing the 1,367 items I re-exported the metadata, changed the owning collections to those based on their type, then re-imported them

  • + - -

    2019-10-21

    - +

    2019-10-21

    - -

    2019-10-24

    - +

    2019-10-24

    - -

    2019-10-25

    - +

    2019-10-25

    - -

    2019-10-28

    - + + +

    2019-10-28

    - -

    2019-10-29

    - + + +

    2019-10-29

    +
  • Assist Maria from Bioversity with community and collection subscriptions
  • - - + diff --git a/docs/2019-11/index.html b/docs/2019-11/index.html index ed18942b2..52a042052 100644 --- a/docs/2019-11/index.html +++ b/docs/2019-11/index.html @@ -8,62 +8,54 @@ - + - + @@ -73,9 +65,9 @@ Let’s see how many of the REST API requests were for bitstreams (because t "@type": "BlogPosting", "headline": "November, 2019", "url": "https:\/\/alanorth.github.io\/cgspace-notes\/2019-11\/", - "wordCount": "3381", + "wordCount": "3457", "datePublished": "2019-11-04T12:20:30+02:00", - "dateModified": "2019-11-26T15:53:57+02:00", + "dateModified": "2019-11-27T14:56:00+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -144,265 +136,226 @@ Let’s see how many of the REST API requests were for bitstreams (because t

    -

    2019-11-04

    - +

    2019-11-04

    # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
     4671942
     # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
     1277694
    -
    - - -
  • So 4.6 million from XMLUI and another 1.2 million from API requests

  • - -
  • Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):

    - +
    # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
     1183456 
     # zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
     106781
    -
  • + - - - -
  • A bit later I checked Solr and found three requests from my IP with that user agent this month:

    - +
    $ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0'
     <?xml version="1.0" encoding="UTF-8"?>
     <response>
     <lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int><lst name="params"><str name="q">ip:73.178.9.24 AND userAgent:iskanie</str><str name="fq">dateYearMonth:2019-11</str><str name="rows">0</str></lst></lst><result name="response" numFound="3" start="0"></result>
     </response>
    -
  • - -
  • Now I want to make similar requests with a user agent that is included in DSpace’s current user agent list:

    - +
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
     $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
     $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"celestial"
    -
  • - -
  • After twenty minutes I didn’t see any requests in Solr, so I assume they did not get logged because they matched a bot list…

    - +
    spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt
    -
  • - - -
  • Apparently that is part of Atmire’s CUA, despite being in a standard DSpace configuration file…

  • - -
  • I tried with some other garbage user agents like “fuuuualan” and they were visible in Solr

    - +
    else if (line.hasOption('m'))
     {
    -SolrLogger.markRobotsByIP();
    +    SolrLogger.markRobotsByIP();
     }
    -
  • - - -
  • WTF again, there is actually a function called markRobotByUserAgent() that is never called anywhere!

    - + - -

    2019-11-05

    - +
  • + +

    2019-11-05

    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu1"
     $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu2"
    -
    - -
  • After committing the changes in Solr I saw one request for “alanfuu1” and no requests for “alanfuu2”:

    - +
    $ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
     $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
    -<result name="response" numFound="1" start="0">
    +  <result name="response" numFound="1" start="0">
     $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
    -<result name="response" numFound="0" start="0"/>
    -
  • - -
  • So basically it seems like a win to update the example file with the latest one from Atmire’s COUNTER-Robots list

    - + <result name="response" numFound="0" start="0"/> + +
  • +
  • I'm curious how the special character matching is in Solr, so I will test two requests: one with “www.gnip.com” which is in the spider list, and one with “www.gnyp.com” which isn't:
  • +
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnip.com"
     $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnyp.com"
    -
    - -
  • Then commit changes to Solr so we don’t have to wait:

    - +
    $ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
     $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound 
    -<result name="response" numFound="0" start="0"/>
    +  <result name="response" numFound="0" start="0"/>
     $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
    -<result name="response" numFound="1" start="0">
    -
  • - -
  • So the blocking seems to be working because “www.gnip.com” is one of the new patterns added to the spiders file…

  • + <result name="response" numFound="1" start="0"> + - -

    2019-11-07

    - +

    2019-11-07

    +
  • I am reconsidering the move of cg.identifier.dataurl to cg.hasMetadata in CG Core v2 -
  • - -
  • Looking into CGSpace statistics again

    - + +
  • +
  • Looking into CGSpace statistics again +
  • +
    $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:BUbiNG*' | xmllint --format - | grep numFound
    -<result name="response" numFound="62944" start="0">
    -
    - -
  • Similar for com.plumanalytics, Grammarly, and ltx71!

    - + <result name="response" numFound="62944" start="0"> +
    $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:
     *com.plumanalytics*' | xmllint --format - | grep numFound
    -<result name="response" numFound="28256" start="0">
    +  <result name="response" numFound="28256" start="0">
     $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*Grammarly*' | xmllint --format - | grep numFound
    -<result name="response" numFound="6288" start="0">
    +  <result name="response" numFound="6288" start="0">
     $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound
    -<result name="response" numFound="105663" start="0">
    -
  • - - -
  • Deleting these seems to work, for example the 105,000 ltx71 records from 2018:

    - + <result name="response" numFound="105663" start="0"> +
    $ http --print b 'http://localhost:8081/solr/statistics-2018/update?stream.body=<delete><query>userAgent:*ltx71*</query><query>type:0</query></delete>&commit=true'
     $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound
    -<result name="response" numFound="0" start="0"/>
    -
  • - -
  • I wrote a quick bash script to check all these user agents against the CGSpace Solr statistics cores

    - + <result name="response" numFound="0" start="0"/> +
    $ curl -s 'http://localhost:8081/solr/statistics/select?facet=true&facet.field=dateYearMonth&facet.mincount=1&facet.offset=0&facet.limit=
     12&q=userAgent:*Unpaywall*' | xmllint --format - | less
     ...
    -<lst name="facet_counts">
    -<lst name="facet_queries"/>
    -<lst name="facet_fields">
    -<lst name="dateYearMonth">
    -<int name="2019-10">198624</int>
    -<int name="2019-05">88422</int>
    -<int name="2019-06">79911</int>
    -<int name="2019-09">67065</int>
    -<int name="2019-07">39026</int>
    -<int name="2019-08">36889</int>
    -<int name="2019-04">36512</int>
    -<int name="2019-11">760</int>
    -</lst>
    -</lst>
    -
  • - - -
  • That answers Peter’s question about why the stats jumped in October…

  • + <lst name="facet_counts"> + <lst name="facet_queries"/> + <lst name="facet_fields"> + <lst name="dateYearMonth"> + <int name="2019-10">198624</int> + <int name="2019-05">88422</int> + <int name="2019-06">79911</int> + <int name="2019-09">67065</int> + <int name="2019-07">39026</int> + <int name="2019-08">36889</int> + <int name="2019-04">36512</int> + <int name="2019-11">760</int> + </lst> + </lst> + - -

    2019-11-08

    - +

    2019-11-08

    +
  • I filed an issue on the COUNTER-Robots project to see if they agree to add User-Agent: to the list of robot user agents
  • - -

    2019-11-09

    - +

    2019-11-09

    +
  • Run all system updates on CGSpace and reboot the server -
  • - -
  • I did some work to clean up my bot processing script and removed about 2 million hits from the statistics cores on CGSpace

    - + +
  • +
  • I did some work to clean up my bot processing script and removed about 2 million hits from the statistics cores on CGSpace +
  • +
    $ for shard in statistics statistics-2018 statistics-2017 statistics-2016 statistics-2015 stat
     istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do ./check-spider-hits.sh -s $shard -p yes; done
    -
    - - -
  • Open a pull request against COUNTER-Robots to remove unnecessary escaping of dashes

  • + - -

    2019-11-12

    - +

    2019-11-12

    +
  • Also, while analysing this, I looked through some of the other top WLE items and fixed some metadata issues (adding dc.rights, fixing DOIs, adding ISSNs, etc) and noticed one issue with an item that has an Altmetric score for its Handle (lower) despite it having a correct DOI (with a higher score) -
  • - -

    2019-11-13

    - + + +

    2019-11-13

    + +
  • Testing modifying some of the COUNTER-Robots patterns to use [0-9] instead of \d digit character type, as Solr's regex search can't use those
  • +
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"Scrapoo/1"
     $ http "http://localhost:8081/solr/statistics/update?commit=true"
     $ http "http://localhost:8081/solr/statistics/select?q=userAgent:Scrapoo*" | xmllint --format - | grep numFound
    -<result name="response" numFound="1" start="0">
    +  <result name="response" numFound="1" start="0">
     $ http "http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/[0-9]/" | xmllint --format - | grep numFound
    -<result name="response" numFound="1" start="0">
    -
    - -
  • Nice, so searching with regex in Solr with // syntax works for those digits!

  • - -
  • I realized that it’s easier to search Solr from curl via POST using this syntax:

    - -
    $ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:*Scrapoo*&rows=0")
    -
  • - -
  • If the parameters include something like “[0-9]” then curl interprets it as a range and will make ten requests

    - -
  • - -
  • I updated the check-spider-hits.sh script to use the POST syntax, and I’m evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling

  • + <result name="response" numFound="1" start="0"> + - -

    2019-11-14

    - +
    $ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:*Scrapoo*&rows=0")
    +
    +
    $ curl -s 'http://localhost:8081/solr/statistics/select' -d 'q=userAgent:/Postgenomic(\s|\+)v2/&rows=2'
    +
    +

    2019-11-14

    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2019-11-14-combined-orcids.txt
     $ ./resolve-orcids.py -i /tmp/2019-11-14-combined-orcids.txt -o /tmp/2019-11-14-combined-names.txt -d
     # sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
     $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
    -
    - -
  • I created a pull request and merged them into the 5_x-prod branch

    - + +
  • +
  • Greatly improve my check-spider-hits.sh script to handle regular expressions in the spider agents patterns file
  • +
  • I've tested it quite a bit on DSpace Test, but I need to do a little more before I feel comfortable running the new code on CGSpace's Solr cores
  • - -

    2019-11-15

    - + + +

    2019-11-15

    +
  • Upon closer inspection, the plus signs seem to be getting misinterpreted somehow in the delete, but not in the select!
  • -
  • Plus signs are special in regular expressions, URLs, and Solr’s Lucene query parser, so I’m actually not sure where the issue is - +
  • Plus signs are special in regular expressions, URLs, and Solr's Lucene query parser, so I'm actually not sure where the issue is
  • +
  • I'm going to ignore regular expressions that have pluses for now
  • + +
  • I think I might also have to ignore patterns that have percent signs, like ^\%?default\%?$
  • After I added the ignores and did some more testing I finally ran the check-spider-hits.sh on all CGSpace Solr statistics cores and these are the number of hits purged from each core: -
  • -
  • That’s 1.4 million hits in addition to the 2 million I purged earlier this week…
  • -
  • For posterity, the major contributors to the hits on the statistics core were: - -
  • - -
  • Most of the curl hits were from CIAT in mid-2019, where they were using GuzzleHttp from PHP, which uses something like this for its user agent:

    - -
    Guzzle/<Guzzle_Version> curl/<curl_version> PHP/<PHP_VERSION>
    -
  • - -
  • Run system updates on DSpace Test and reboot the server

  • - -

    2019-11-17

    - + +
  • That's 1.4 million hits in addition to the 2 million I purged earlier this week…
  • +
  • For posterity, the major contributors to the hits on the statistics core were: +
  • +
  • Most of the curl hits were from CIAT in mid-2019, where they were using GuzzleHttp from PHP, which uses something like this for its user agent:
  • + +
    Guzzle/<Guzzle_Version> curl/<curl_version> PHP/<PHP_VERSION>
    +
    +

    2019-11-17

    + +
  • I finally decided to revert cg.hasMetadata back to cg.identifier.dataurl in my CG Core v2 branch (see #10)
  • Regarding the WLE item that has a much lower score than its DOI… -
  • + +
  • Finally deploy 5_x-cgcorev2 branch on DSpace Test
  • - -

    2019-11-18

    - +

    2019-11-18

    - -

    2019-11-19

    - +

    2019-11-19

    +
  • Atmire merged my pull request regarding unnecessary escaping of dashes in regular expressions, as well as my suggestion of adding “User-Agent” to the list of patterns
  • I made another pull request to fix invalid escaping of one of their new patterns
  • I ran my check-spider-hits.sh script again with these new patterns and found a bunch more statistics requests that match, for example: -
  • - -
  • Buck is one I’ve never heard of before, its user agent is:

    - + +
  • +
  • Buck is one I've never heard of before, its user agent is:
  • +
    Buck/2.2; (+https://app.hypefactors.com/media-monitoring/about.html)
    -
    - -
  • All in all that’s about 85,000 more hits purged, in addition to the 3.4 million I purged last week

  • + - -

    2019-11-20

    - +

    2019-11-20

    - -

    2019-11-21

    - +

    2019-11-21

    +
  • As for the larger features to focus on in the future ToRs: -
  • + +
  • We have a meeting about AReS future developments with Jane, Abenet, Peter, and Enrico tomorrow
  • - -

    2019-11-22

    - +

    2019-11-22

    - -

    2019-11-24

    - + + +

    2019-11-24

    + + +

    2019-11-25

    - -

    2019-11-26

    - + + +

    2019-11-26

    - -

    2019-11-27

    - + + +

    2019-11-27

    +
  • File a ticket (242418) with Altmetric about DCTERMS migration to see if there is anything we need to be careful about
  • Make a pull request against cg-core schema to fix inconsistent references to cg.embargoDate (#13)
  • Review the AReS feedback again after Peter made some comments -
  • + +
  • I need to ask Marie-Angelique about the cg.peer-reviewed field -
  • - - + + +

    2019-11-28

    + + diff --git a/docs/404.html b/docs/404.html index 7a552fb1f..838aa12be 100644 --- a/docs/404.html +++ b/docs/404.html @@ -14,7 +14,7 @@ - + diff --git a/docs/categories/index.html b/docs/categories/index.html index 21fc36fba..d7c955723 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -9,13 +9,12 @@ - - + @@ -100,31 +99,27 @@

    -

    2019-11-04

    - +

    2019-11-04

    # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
     4671942
     # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
     1277694
    -
    - - -
  • So 4.6 million from XMLUI and another 1.2 million from API requests

  • - -
  • Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):

    - +
    # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
     1183456 
     # zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
     106781
    -
  • - + Read more → @@ -145,7 +140,6 @@

    Possible changes to CGSpace metadata fields to align more with DC, QDC, and DCTERMS as well as CG Core v2.

    -

    With reference to CG Core v2 draft standard by Marie-Angélique as well as DCMI DCTERMS.

    Read more → @@ -164,8 +158,7 @@

    - 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace - I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: + 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's “unneccesary Unicode” fix: $ csvcut -c 'id,dc. Read more → @@ -183,37 +176,34 @@

    -

    2019-09-01

    - +

    2019-09-01

    +
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    440 17.58.101.255
    +    441 157.55.39.101
    +    485 207.46.13.43
    +    728 169.60.128.125
    +    730 207.46.13.108
    +    758 157.55.39.9
    +    808 66.160.140.179
    +    814 207.46.13.212
    +   2472 163.172.71.23
    +   6092 3.94.211.189
    +# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +     33 2a01:7e00::f03c:91ff:fe16:fcb
    +     57 3.83.192.124
    +     57 3.87.77.25
    +     57 54.82.1.8
    +    822 2a01:9cc0:47:1:1a:4:0:2
    +   1223 45.5.184.72
    +   1633 172.104.229.92
    +   5112 205.186.128.185
    +   7249 2a01:7e00::f03c:91ff:fe18:7396
    +   9124 45.5.186.2
    +
    Read more → @@ -231,22 +221,19 @@

    -

    2019-08-03

    - +

    2019-08-03

    - -

    2019-08-04

    - +

    2019-08-04

    +
  • Run system updates on DSpace Test (linode19) and reboot it
  • Read more → @@ -266,16 +253,15 @@

    -

    2019-07-01

    - +

    2019-07-01

    +
  • Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community
  • Read more → @@ -295,15 +281,12 @@

    -

    2019-06-02

    - +

    2019-06-02

    - -

    2019-06-03

    - +

    2019-06-03

    @@ -324,24 +307,21 @@

    -

    2019-05-01

    - +

    2019-05-01

    + +
  • The item seems to be in a pre-submitted state, so I tried to delete it from there:
  • +
    dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
     DELETE 1
    -
    - -
  • But after this I tried to delete the item from the XMLUI and it is still present…

  • + Read more → @@ -360,35 +340,30 @@ DELETE 1

    -

    2019-04-01

    - +

    2019-04-01

    + +
  • There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today +
  • +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
    -4432 200
    -
    - - -
  • In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses

  • - -
  • Apply country and region corrections and deletions on DSpace Test and CGSpace:

    - + 4432 200 +
    $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
     $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
    -
  • - + Read more → @@ -406,20 +381,19 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace

    -

    2019-03-01

    - +

    2019-03-01

    +
  • I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs
  • Read more → diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index dc291b3d2..50a742bdb 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -9,13 +9,12 @@ - - + @@ -85,31 +84,27 @@

    -

    2019-11-04

    - +

    2019-11-04

    # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
     4671942
     # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
     1277694
    -
    - - -
  • So 4.6 million from XMLUI and another 1.2 million from API requests

  • - -
  • Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):

    - +
    # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
     1183456 
     # zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
     106781
    -
  • - + Read more → @@ -130,7 +125,6 @@

    Possible changes to CGSpace metadata fields to align more with DC, QDC, and DCTERMS as well as CG Core v2.

    -

    With reference to CG Core v2 draft standard by Marie-Angélique as well as DCMI DCTERMS.

    Read more → @@ -149,8 +143,7 @@

    - 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace - I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: + 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's “unneccesary Unicode” fix: $ csvcut -c 'id,dc. Read more → @@ -168,37 +161,34 @@

    -

    2019-09-01

    - +

    2019-09-01

    +
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    440 17.58.101.255
    +    441 157.55.39.101
    +    485 207.46.13.43
    +    728 169.60.128.125
    +    730 207.46.13.108
    +    758 157.55.39.9
    +    808 66.160.140.179
    +    814 207.46.13.212
    +   2472 163.172.71.23
    +   6092 3.94.211.189
    +# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +     33 2a01:7e00::f03c:91ff:fe16:fcb
    +     57 3.83.192.124
    +     57 3.87.77.25
    +     57 54.82.1.8
    +    822 2a01:9cc0:47:1:1a:4:0:2
    +   1223 45.5.184.72
    +   1633 172.104.229.92
    +   5112 205.186.128.185
    +   7249 2a01:7e00::f03c:91ff:fe18:7396
    +   9124 45.5.186.2
    +
    Read more → @@ -216,22 +206,19 @@

    -

    2019-08-03

    - +

    2019-08-03

    - -

    2019-08-04

    - +

    2019-08-04

    +
  • Run system updates on DSpace Test (linode19) and reboot it
  • Read more → @@ -251,16 +238,15 @@

    -

    2019-07-01

    - +

    2019-07-01

    +
  • Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community
  • Read more → @@ -280,15 +266,12 @@

    -

    2019-06-02

    - +

    2019-06-02

    - -

    2019-06-03

    - +

    2019-06-03

    @@ -309,24 +292,21 @@

    -

    2019-05-01

    - +

    2019-05-01

    + +
  • The item seems to be in a pre-submitted state, so I tried to delete it from there:
  • +
    dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
     DELETE 1
    -
    - -
  • But after this I tried to delete the item from the XMLUI and it is still present…

  • + Read more → @@ -345,35 +325,30 @@ DELETE 1

    -

    2019-04-01

    - +

    2019-04-01

    + +
  • There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today +
  • +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
    -4432 200
    -
    - - -
  • In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses

  • - -
  • Apply country and region corrections and deletions on DSpace Test and CGSpace:

    - + 4432 200 +
    $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
     $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
    -
  • - + Read more → @@ -391,20 +366,19 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace

    -

    2019-03-01

    - +

    2019-03-01

    +
  • I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs
  • Read more → diff --git a/docs/categories/notes/index.xml b/docs/categories/notes/index.xml index d006cc6a3..f8c9aea1b 100644 --- a/docs/categories/notes/index.xml +++ b/docs/categories/notes/index.xml @@ -17,31 +17,27 @@ Mon, 04 Nov 2019 12:20:30 +0200 https://alanorth.github.io/cgspace-notes/2019-11/ - <h2 id="2019-11-04">2019-11-04</h2> - + <h2 id="20191104">2019-11-04</h2> <ul> -<li><p>Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics</p> - +<li>Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics <ul> -<li><p>I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:</p> - +<li>I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:</li> +</ul> +</li> +</ul> <pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot; 4671942 # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot; 1277694 -</code></pre></li> -</ul></li> - -<li><p>So 4.6 million from XMLUI and another 1.2 million from API requests</p></li> - -<li><p>Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</p> - +</code></pre><ul> +<li>So 4.6 million from XMLUI and another 1.2 million from API requests</li> +<li>Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li> +</ul> <pre><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot; 1183456 # zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot; 106781 -</code></pre></li> -</ul> +</code></pre> @@ -51,7 +47,6 @@ https://alanorth.github.io/cgspace-notes/cgspace-cgcorev2-migration/ <p>Possible changes to CGSpace metadata fields to align more with DC, QDC, and DCTERMS as well as CG Core v2.</p> - <p>With reference to <a href="https://agriculturalsemantics.github.io/cg-core/cgcore.html">CG Core v2 draft standard</a> by Marie-Angélique as well as <a href="http://www.dublincore.org/specifications/dublin-core/dcmi-terms/">DCMI DCTERMS</a>.</p> @@ -61,8 +56,7 @@ Tue, 01 Oct 2019 13:20:51 +0300 https://alanorth.github.io/cgspace-notes/2019-10/ - 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace - I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: + 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c 'id,dc. @@ -71,37 +65,34 @@ Sun, 01 Sep 2019 10:17:51 +0300 https://alanorth.github.io/cgspace-notes/2019-09/ - <h2 id="2019-09-01">2019-09-01</h2> - + <h2 id="20190901">2019-09-01</h2> <ul> <li>Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning</li> - -<li><p>Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:</p> - +<li>Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:</li> +</ul> <pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 -440 17.58.101.255 -441 157.55.39.101 -485 207.46.13.43 -728 169.60.128.125 -730 207.46.13.108 -758 157.55.39.9 -808 66.160.140.179 -814 207.46.13.212 -2472 163.172.71.23 -6092 3.94.211.189 + 440 17.58.101.255 + 441 157.55.39.101 + 485 207.46.13.43 + 728 169.60.128.125 + 730 207.46.13.108 + 758 157.55.39.9 + 808 66.160.140.179 + 814 207.46.13.212 + 2472 163.172.71.23 + 6092 3.94.211.189 # zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 - 33 2a01:7e00::f03c:91ff:fe16:fcb - 57 3.83.192.124 - 57 3.87.77.25 - 57 54.82.1.8 -822 2a01:9cc0:47:1:1a:4:0:2 -1223 45.5.184.72 -1633 172.104.229.92 -5112 205.186.128.185 -7249 2a01:7e00::f03c:91ff:fe18:7396 -9124 45.5.186.2 -</code></pre></li> -</ul> + 33 2a01:7e00::f03c:91ff:fe16:fcb + 57 3.83.192.124 + 57 3.87.77.25 + 57 54.82.1.8 + 822 2a01:9cc0:47:1:1a:4:0:2 + 1223 45.5.184.72 + 1633 172.104.229.92 + 5112 205.186.128.185 + 7249 2a01:7e00::f03c:91ff:fe18:7396 + 9124 45.5.186.2 +</code></pre> @@ -110,22 +101,19 @@ Sat, 03 Aug 2019 12:39:51 +0300 https://alanorth.github.io/cgspace-notes/2019-08/ - <h2 id="2019-08-03">2019-08-03</h2> - + <h2 id="20190803">2019-08-03</h2> <ul> -<li>Look at Bioversity&rsquo;s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name&hellip;</li> +<li>Look at Bioversity's latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name&hellip;</li> </ul> - -<h2 id="2019-08-04">2019-08-04</h2> - +<h2 id="20190804">2019-08-04</h2> <ul> <li>Deploy ORCID identifier updates requested by Bioversity to CGSpace</li> <li>Run system updates on CGSpace (linode18) and reboot it - <ul> <li>Before updating it I checked Solr and verified that all statistics cores were loaded properly&hellip;</li> -<li>After rebooting, all statistics cores were loaded&hellip; wow, that&rsquo;s lucky.</li> -</ul></li> +<li>After rebooting, all statistics cores were loaded&hellip; wow, that's lucky.</li> +</ul> +</li> <li>Run system updates on DSpace Test (linode19) and reboot it</li> </ul> @@ -136,16 +124,15 @@ Mon, 01 Jul 2019 12:13:51 +0300 https://alanorth.github.io/cgspace-notes/2019-07/ - <h2 id="2019-07-01">2019-07-01</h2> - + <h2 id="20190701">2019-07-01</h2> <ul> <li>Create an &ldquo;AfricaRice books and book chapters&rdquo; collection on CGSpace for AfricaRice</li> <li>Last month Sisay asked why the following &ldquo;most popular&rdquo; statistics link for a range of months in 2018 works for the CIAT community on DSpace Test, but not on CGSpace: - <ul> <li><a href="https://dspacetest.cgiar.org/handle/10568/35697/most-popular/item#simplefilter=custom&amp;time_filter_end_date=01%2F12%2F2018">DSpace Test</a></li> <li><a href="https://cgspace.cgiar.org/handle/10568/35697/most-popular/item#simplefilter=custom&amp;time_filter_end_date=01%2F12%2F2018">CGSpace</a></li> -</ul></li> +</ul> +</li> <li>Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community</li> </ul> @@ -156,15 +143,12 @@ Sun, 02 Jun 2019 10:57:51 +0300 https://alanorth.github.io/cgspace-notes/2019-06/ - <h2 id="2019-06-02">2019-06-02</h2> - + <h2 id="20190602">2019-06-02</h2> <ul> <li>Merge the <a href="https://github.com/ilri/DSpace/pull/425">Solr filterCache</a> and <a href="https://github.com/ilri/DSpace/pull/426">XMLUI ISI journal</a> changes to the <code>5_x-prod</code> branch and deploy on CGSpace</li> <li>Run system updates on CGSpace (linode18) and reboot it</li> </ul> - -<h2 id="2019-06-03">2019-06-03</h2> - +<h2 id="20190603">2019-06-03</h2> <ul> <li>Skype with Marie-Angélique and Abenet about <a href="https://agriculturalsemantics.github.io/cg-core/cgcore.html">CG Core v2</a></li> </ul> @@ -176,24 +160,21 @@ Wed, 01 May 2019 07:37:43 +0300 https://alanorth.github.io/cgspace-notes/2019-05/ - <h2 id="2019-05-01">2019-05-01</h2> - + <h2 id="20190501">2019-05-01</h2> <ul> <li>Help CCAFS with regenerating some item thumbnails after they uploaded new PDFs to some items on CGSpace</li> <li>A user on the dspace-tech mailing list offered some suggestions for troubleshooting the problem with the inability to delete certain items - <ul> <li>Apparently if the item is in the <code>workflowitem</code> table it is submitted to a workflow</li> <li>And if it is in the <code>workspaceitem</code> table it is in the pre-submitted state</li> -</ul></li> - -<li><p>The item seems to be in a pre-submitted state, so I tried to delete it from there:</p> - +</ul> +</li> +<li>The item seems to be in a pre-submitted state, so I tried to delete it from there:</li> +</ul> <pre><code>dspace=# DELETE FROM workspaceitem WHERE item_id=74648; DELETE 1 -</code></pre></li> - -<li><p>But after this I tried to delete the item from the XMLUI and it is <em>still</em> present&hellip;</p></li> +</code></pre><ul> +<li>But after this I tried to delete the item from the XMLUI and it is <em>still</em> present&hellip;</li> </ul> @@ -203,35 +184,30 @@ DELETE 1 Mon, 01 Apr 2019 09:00:43 +0300 https://alanorth.github.io/cgspace-notes/2019-04/ - <h2 id="2019-04-01">2019-04-01</h2> - + <h2 id="20190401">2019-04-01</h2> <ul> <li>Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc - <ul> <li>They asked if we had plans to enable RDF support in CGSpace</li> -</ul></li> - -<li><p>There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today</p> - +</ul> +</li> +<li>There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today <ul> -<li><p>I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!</p> - +<li>I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!</li> +</ul> +</li> +</ul> <pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5 -4432 200 -</code></pre></li> -</ul></li> - -<li><p>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</p></li> - -<li><p>Apply country and region corrections and deletions on DSpace Test and CGSpace:</p> - + 4432 200 +</code></pre><ul> +<li>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</li> +<li>Apply country and region corrections and deletions on DSpace Test and CGSpace:</li> +</ul> <pre><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d -</code></pre></li> -</ul> +</code></pre> @@ -240,20 +216,19 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace Fri, 01 Mar 2019 12:16:30 +0100 https://alanorth.github.io/cgspace-notes/2019-03/ - <h2 id="2019-03-01">2019-03-01</h2> - + <h2 id="20190301">2019-03-01</h2> <ul> -<li>I checked IITA&rsquo;s 259 Feb 14 records from last month for duplicates using Atmire&rsquo;s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good</li> +<li>I checked IITA's 259 Feb 14 records from last month for duplicates using Atmire's Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good</li> <li>I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc&hellip;</li> -<li>Looking at the other half of Udana&rsquo;s WLE records from 2018-11 - +<li>Looking at the other half of Udana's WLE records from 2018-11 <ul> <li>I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)</li> <li>I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items</li> <li>Most worryingly, there are encoding errors in the abstracts for eleven items, for example:</li> <li>68.15% � 9.45 instead of 68.15% ± 9.45</li> <li>2003�2013 instead of 2003–2013</li> -</ul></li> +</ul> +</li> <li>I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs</li> </ul> @@ -264,40 +239,34 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace Fri, 01 Feb 2019 21:37:30 +0200 https://alanorth.github.io/cgspace-notes/2019-02/ - <h2 id="2019-02-01">2019-02-01</h2> - + <h2 id="20190201">2019-02-01</h2> <ul> <li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li> - -<li><p>The top IPs before, during, and after this latest alert tonight were:</p> - +<li>The top IPs before, during, and after this latest alert tonight were:</li> +</ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 -245 207.46.13.5 -332 54.70.40.11 -385 5.143.231.38 -405 207.46.13.173 -405 207.46.13.75 -1117 66.249.66.219 -1121 35.237.175.180 -1546 5.9.6.51 -2474 45.5.186.2 -5490 85.25.237.71 -</code></pre></li> - -<li><p><code>85.25.237.71</code> is the &ldquo;Linguee Bot&rdquo; that I first saw last month</p></li> - -<li><p>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</p></li> - -<li><p>There were just over 3 million accesses in the nginx logs last month:</p> - + 245 207.46.13.5 + 332 54.70.40.11 + 385 5.143.231.38 + 405 207.46.13.173 + 405 207.46.13.75 + 1117 66.249.66.219 + 1121 35.237.175.180 + 1546 5.9.6.51 + 2474 45.5.186.2 + 5490 85.25.237.71 +</code></pre><ul> +<li><code>85.25.237.71</code> is the &ldquo;Linguee Bot&rdquo; that I first saw last month</li> +<li>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</li> +<li>There were just over 3 million accesses in the nginx logs last month:</li> +</ul> <pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot; 3018243 real 0m19.873s user 0m22.203s sys 0m1.979s -</code></pre></li> -</ul> +</code></pre> @@ -306,26 +275,23 @@ sys 0m1.979s Wed, 02 Jan 2019 09:48:30 +0200 https://alanorth.github.io/cgspace-notes/2019-01/ - <h2 id="2019-01-02">2019-01-02</h2> - + <h2 id="20190102">2019-01-02</h2> <ul> <li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li> - -<li><p>I don&rsquo;t see anything interesting in the web server logs around that time though:</p> - +<li>I don't see anything interesting in the web server logs around that time though:</li> +</ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 - 92 40.77.167.4 - 99 210.7.29.100 -120 38.126.157.45 -177 35.237.175.180 -177 40.77.167.32 -216 66.249.75.219 -225 18.203.76.93 -261 46.101.86.248 -357 207.46.13.1 -903 54.70.40.11 -</code></pre></li> -</ul> + 92 40.77.167.4 + 99 210.7.29.100 + 120 38.126.157.45 + 177 35.237.175.180 + 177 40.77.167.32 + 216 66.249.75.219 + 225 18.203.76.93 + 261 46.101.86.248 + 357 207.46.13.1 + 903 54.70.40.11 +</code></pre> @@ -334,16 +300,13 @@ sys 0m1.979s Sun, 02 Dec 2018 02:09:30 +0200 https://alanorth.github.io/cgspace-notes/2018-12/ - <h2 id="2018-12-01">2018-12-01</h2> - + <h2 id="20181201">2018-12-01</h2> <ul> <li>Switch CGSpace (linode18) to use OpenJDK instead of Oracle JDK</li> <li>I manually installed OpenJDK, then removed Oracle JDK, then re-ran the <a href="http://github.com/ilri/rmg-ansible-public">Ansible playbook</a> to update all configuration files, etc</li> <li>Then I ran all system updates and restarted the server</li> </ul> - -<h2 id="2018-12-02">2018-12-02</h2> - +<h2 id="20181202">2018-12-02</h2> <ul> <li>I noticed that there is another issue with PDF thumbnails on CGSpace, and I see there was another <a href="https://usn.ubuntu.com/3831-1/">Ghostscript vulnerability last week</a></li> </ul> @@ -355,15 +318,12 @@ sys 0m1.979s Thu, 01 Nov 2018 16:41:30 +0200 https://alanorth.github.io/cgspace-notes/2018-11/ - <h2 id="2018-11-01">2018-11-01</h2> - + <h2 id="20181101">2018-11-01</h2> <ul> <li>Finalize AReS Phase I and Phase II ToRs</li> <li>Send a note about my <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a> to the dspace-tech mailing list</li> </ul> - -<h2 id="2018-11-03">2018-11-03</h2> - +<h2 id="20181103">2018-11-03</h2> <ul> <li>Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage</li> <li>Today these are the top 10 IPs:</li> @@ -376,11 +336,10 @@ sys 0m1.979s Mon, 01 Oct 2018 22:31:54 +0300 https://alanorth.github.io/cgspace-notes/2018-10/ - <h2 id="2018-10-01">2018-10-01</h2> - + <h2 id="20181001">2018-10-01</h2> <ul> <li>Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items</li> -<li>I created a GitHub issue to track this <a href="https://github.com/ilri/DSpace/issues/389">#389</a>, because I&rsquo;m super busy in Nairobi right now</li> +<li>I created a GitHub issue to track this <a href="https://github.com/ilri/DSpace/issues/389">#389</a>, because I'm super busy in Nairobi right now</li> </ul> @@ -390,13 +349,12 @@ sys 0m1.979s Sun, 02 Sep 2018 09:55:54 +0300 https://alanorth.github.io/cgspace-notes/2018-09/ - <h2 id="2018-09-02">2018-09-02</h2> - + <h2 id="20180902">2018-09-02</h2> <ul> <li>New <a href="https://jdbc.postgresql.org/documentation/changelog.html#version_42.2.5">PostgreSQL JDBC driver version 42.2.5</a></li> -<li>I&rsquo;ll update the DSpace role in our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a> and run the updated playbooks on CGSpace and DSpace Test</li> -<li>Also, I&rsquo;ll re-run the <code>postgresql</code> tasks because the custom PostgreSQL variables are dynamic according to the system&rsquo;s RAM, and we never re-ran them after migrating to larger Linodes last month</li> -<li>I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&rsquo;m getting those autowire errors in Tomcat 8.5.30 again:</li> +<li>I'll update the DSpace role in our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a> and run the updated playbooks on CGSpace and DSpace Test</li> +<li>Also, I'll re-run the <code>postgresql</code> tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month</li> +<li>I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:</li> </ul> @@ -406,27 +364,20 @@ sys 0m1.979s Wed, 01 Aug 2018 11:52:54 +0300 https://alanorth.github.io/cgspace-notes/2018-08/ - <h2 id="2018-08-01">2018-08-01</h2> - + <h2 id="20180801">2018-08-01</h2> <ul> -<li><p>DSpace Test had crashed at some point yesterday morning and I see the following in <code>dmesg</code>:</p> - +<li>DSpace Test had crashed at some point yesterday morning and I see the following in <code>dmesg</code>:</li> +</ul> <pre><code>[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB -</code></pre></li> - -<li><p>Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight</p></li> - -<li><p>From the DSpace log I see that eventually Solr stopped responding, so I guess the <code>java</code> process that was OOM killed above was Tomcat&rsquo;s</p></li> - -<li><p>I&rsquo;m not sure why Tomcat didn&rsquo;t crash with an OutOfMemoryError&hellip;</p></li> - -<li><p>Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core</p></li> - -<li><p>The server only has 8GB of RAM so we&rsquo;ll eventually need to upgrade to a larger one because we&rsquo;ll start starving the OS, PostgreSQL, and command line batch processes</p></li> - -<li><p>I ran all system updates on DSpace Test and rebooted it</p></li> +</code></pre><ul> +<li>Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight</li> +<li>From the DSpace log I see that eventually Solr stopped responding, so I guess the <code>java</code> process that was OOM killed above was Tomcat's</li> +<li>I'm not sure why Tomcat didn't crash with an OutOfMemoryError&hellip;</li> +<li>Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core</li> +<li>The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes</li> +<li>I ran all system updates on DSpace Test and rebooted it</li> </ul> @@ -436,19 +387,16 @@ sys 0m1.979s Sun, 01 Jul 2018 12:56:54 +0300 https://alanorth.github.io/cgspace-notes/2018-07/ - <h2 id="2018-07-01">2018-07-01</h2> - + <h2 id="20180701">2018-07-01</h2> <ul> -<li><p>I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:</p> - +<li>I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:</li> +</ul> <pre><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace -</code></pre></li> - -<li><p>During the <code>mvn package</code> stage on the 5.8 branch I kept getting issues with java running out of memory:</p> - +</code></pre><ul> +<li>During the <code>mvn package</code> stage on the 5.8 branch I kept getting issues with java running out of memory:</li> +</ul> <pre><code>There is insufficient memory for the Java Runtime Environment to continue. -</code></pre></li> -</ul> +</code></pre> @@ -457,32 +405,27 @@ sys 0m1.979s Mon, 04 Jun 2018 19:49:54 -0700 https://alanorth.github.io/cgspace-notes/2018-06/ - <h2 id="2018-06-04">2018-06-04</h2> - + <h2 id="20180604">2018-06-04</h2> <ul> <li>Test the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560">DSpace 5.8 module upgrades from Atmire</a> (<a href="https://github.com/ilri/DSpace/pull/378">#378</a>) - <ul> -<li>There seems to be a problem with the CUA and L&amp;R versions in <code>pom.xml</code> because they are using SNAPSHOT and it doesn&rsquo;t build</li> -</ul></li> +<li>There seems to be a problem with the CUA and L&amp;R versions in <code>pom.xml</code> because they are using SNAPSHOT and it doesn't build</li> +</ul> +</li> <li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li> - -<li><p>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</p> - +<li>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</li> +</ul> <pre><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n -</code></pre></li> - -<li><p>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-03/">March, 2018</a></p></li> - -<li><p>Time to index ~70,000 items on CGSpace:</p> - +</code></pre><ul> +<li>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-03/">March, 2018</a></li> +<li>Time to index ~70,000 items on CGSpace:</li> +</ul> <pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b real 74m42.646s user 8m5.056s sys 2m7.289s -</code></pre></li> -</ul> +</code></pre> @@ -491,15 +434,14 @@ sys 2m7.289s Tue, 01 May 2018 16:43:54 +0300 https://alanorth.github.io/cgspace-notes/2018-05/ - <h2 id="2018-05-01">2018-05-01</h2> - + <h2 id="20180501">2018-05-01</h2> <ul> <li>I cleared the Solr statistics core on DSpace Test by issuing two commands directly to the Solr admin interface: - <ul> -<li><a href="http://localhost:3000/solr/statistics/update?stream.body=%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E">http://localhost:3000/solr/statistics/update?stream.body=%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E</a></li> -<li><a href="http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E">http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E</a></li> -</ul></li> +<li>http://localhost:3000/solr/statistics/update?stream.body=%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E</li> +<li>http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E</li> +</ul> +</li> <li>Then I reduced the JVM heap size from 6144 back to 5120m</li> <li>Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a> to support hosts choosing which distribution they want to use</li> </ul> @@ -511,10 +453,9 @@ sys 2m7.289s Sun, 01 Apr 2018 16:13:54 +0200 https://alanorth.github.io/cgspace-notes/2018-04/ - <h2 id="2018-04-01">2018-04-01</h2> - + <h2 id="20180401">2018-04-01</h2> <ul> -<li>I tried to test something on DSpace Test but noticed that it&rsquo;s down since god knows when</li> +<li>I tried to test something on DSpace Test but noticed that it's down since god knows when</li> <li>Catalina logs at least show some memory errors yesterday:</li> </ul> @@ -525,8 +466,7 @@ sys 2m7.289s Fri, 02 Mar 2018 16:07:54 +0200 https://alanorth.github.io/cgspace-notes/2018-03/ - <h2 id="2018-03-02">2018-03-02</h2> - + <h2 id="20180302">2018-03-02</h2> <ul> <li>Export a CSV of the IITA community metadata for Martin Mueller</li> </ul> @@ -538,13 +478,12 @@ sys 2m7.289s Thu, 01 Feb 2018 16:28:54 +0200 https://alanorth.github.io/cgspace-notes/2018-02/ - <h2 id="2018-02-01">2018-02-01</h2> - + <h2 id="20180201">2018-02-01</h2> <ul> <li>Peter gave feedback on the <code>dc.rights</code> proof of concept that I had sent him last week</li> -<li>We don&rsquo;t need to distinguish between internal and external works, so that makes it just a simple list</li> +<li>We don't need to distinguish between internal and external works, so that makes it just a simple list</li> <li>Yesterday I figured out how to monitor DSpace sessions using JMX</li> -<li>I copied the logic in the <code>jmx_tomcat_dbpools</code> provided by Ubuntu&rsquo;s <code>munin-plugins-java</code> package and used the stuff I discovered about JMX <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-01/">in 2018-01</a></li> +<li>I copied the logic in the <code>jmx_tomcat_dbpools</code> provided by Ubuntu's <code>munin-plugins-java</code> package and used the stuff I discovered about JMX <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-01/">in 2018-01</a></li> </ul> @@ -554,33 +493,26 @@ sys 2m7.289s Tue, 02 Jan 2018 08:35:54 -0800 https://alanorth.github.io/cgspace-notes/2018-01/ - <h2 id="2018-01-02">2018-01-02</h2> - + <h2 id="20180102">2018-01-02</h2> <ul> <li>Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time</li> -<li>I didn&rsquo;t get any load alerts from Linode and the REST and XMLUI logs don&rsquo;t show anything out of the ordinary</li> +<li>I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary</li> <li>The nginx logs show HTTP 200s until <code>02/Jan/2018:11:27:17 +0000</code> when Uptime Robot got an HTTP 500</li> <li>In dspace.log around that time I see many errors like &ldquo;Client closed the connection before file download was complete&rdquo;</li> - -<li><p>And just before that I see this:</p> - +<li>And just before that I see this:</li> +</ul> <pre><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000]. -</code></pre></li> - -<li><p>Ah hah! So the pool was actually empty!</p></li> - -<li><p>I need to increase that, let&rsquo;s try to bump it up from 50 to 75</p></li> - -<li><p>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&rsquo;t know what the hell Uptime Robot saw</p></li> - -<li><p>I notice this error quite a few times in dspace.log:</p> - +</code></pre><ul> +<li>Ah hah! So the pool was actually empty!</li> +<li>I need to increase that, let's try to bump it up from 50 to 75</li> +<li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw</li> +<li>I notice this error quite a few times in dspace.log:</li> +</ul> <pre><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32. -</code></pre></li> - -<li><p>And there are many of these errors every day for the past month:</p> - +</code></pre><ul> +<li>And there are many of these errors every day for the past month:</li> +</ul> <pre><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.* dspace.log.2017-11-21:4 dspace.log.2017-11-22:1 @@ -625,9 +557,8 @@ dspace.log.2017-12-30:89 dspace.log.2017-12-31:53 dspace.log.2018-01-01:45 dspace.log.2018-01-02:34 -</code></pre></li> - -<li><p>Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let&rsquo;s Encrypt if it&rsquo;s just a handful of domains</p></li> +</code></pre><ul> +<li>Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains</li> </ul> @@ -637,8 +568,7 @@ dspace.log.2018-01-02:34 Fri, 01 Dec 2017 13:53:54 +0300 https://alanorth.github.io/cgspace-notes/2017-12/ - <h2 id="2017-12-01">2017-12-01</h2> - + <h2 id="20171201">2017-12-01</h2> <ul> <li>Uptime Robot noticed that CGSpace went down</li> <li>The logs say &ldquo;Timeout waiting for idle object&rdquo;</li> @@ -653,27 +583,22 @@ dspace.log.2018-01-02:34 Thu, 02 Nov 2017 09:37:54 +0200 https://alanorth.github.io/cgspace-notes/2017-11/ - <h2 id="2017-11-01">2017-11-01</h2> - + <h2 id="20171101">2017-11-01</h2> <ul> <li>The CORE developers responded to say they are looking into their bot not respecting our robots.txt</li> </ul> - -<h2 id="2017-11-02">2017-11-02</h2> - +<h2 id="20171102">2017-11-02</h2> <ul> -<li><p>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</p> - +<li>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</li> +</ul> <pre><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log 0 -</code></pre></li> - -<li><p>Generate list of authors on CGSpace for Peter to go through and correct:</p> - +</code></pre><ul> +<li>Generate list of authors on CGSpace for Peter to go through and correct:</li> +</ul> <pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv; COPY 54701 -</code></pre></li> -</ul> +</code></pre> @@ -682,17 +607,14 @@ COPY 54701 Sun, 01 Oct 2017 08:07:54 +0300 https://alanorth.github.io/cgspace-notes/2017-10/ - <h2 id="2017-10-01">2017-10-01</h2> - + <h2 id="20171001">2017-10-01</h2> <ul> -<li><p>Peter emailed to point out that many items in the <a href="https://cgspace.cgiar.org/handle/10568/2703">ILRI archive collection</a> have multiple handles:</p> - +<li>Peter emailed to point out that many items in the <a href="https://cgspace.cgiar.org/handle/10568/2703">ILRI archive collection</a> have multiple handles:</li> +</ul> <pre><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336 -</code></pre></li> - -<li><p>There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</p></li> - -<li><p>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</p></li> +</code></pre><ul> +<li>There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li> +<li>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</li> </ul> diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index ad86a1781..07f6a576c 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -9,13 +9,12 @@ - - + @@ -85,40 +84,34 @@

    -

    2019-02-01

    - +

    2019-02-01

    • Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
    • - -
    • The top IPs before, during, and after this latest alert tonight were:

      - +
    • The top IPs before, during, and after this latest alert tonight were:
    • +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -245 207.46.13.5
    -332 54.70.40.11
    -385 5.143.231.38
    -405 207.46.13.173
    -405 207.46.13.75
    -1117 66.249.66.219
    -1121 35.237.175.180
    -1546 5.9.6.51
    -2474 45.5.186.2
    -5490 85.25.237.71
    -
    - -
  • 85.25.237.71 is the “Linguee Bot” that I first saw last month

  • - -
  • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase

  • - -
  • There were just over 3 million accesses in the nginx logs last month:

    - + 245 207.46.13.5 + 332 54.70.40.11 + 385 5.143.231.38 + 405 207.46.13.173 + 405 207.46.13.75 + 1117 66.249.66.219 + 1121 35.237.175.180 + 1546 5.9.6.51 + 2474 45.5.186.2 + 5490 85.25.237.71 +
      +
    • 85.25.237.71 is the “Linguee Bot” that I first saw last month
    • +
    • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
    • +
    • There were just over 3 million accesses in the nginx logs last month:
    • +
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
     3018243
     
     real    0m19.873s
     user    0m22.203s
     sys     0m1.979s
    -
  • - + Read more → @@ -136,26 +129,23 @@ sys 0m1.979s

    -

    2019-01-02

    - +

    2019-01-02

    • Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
    • - -
    • I don’t see anything interesting in the web server logs around that time though:

      - -
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      - 92 40.77.167.4
      - 99 210.7.29.100
      -120 38.126.157.45
      -177 35.237.175.180
      -177 40.77.167.32
      -216 66.249.75.219
      -225 18.203.76.93
      -261 46.101.86.248
      -357 207.46.13.1
      -903 54.70.40.11
      -
    • +
    • I don't see anything interesting in the web server logs around that time though:
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +     92 40.77.167.4
    +     99 210.7.29.100
    +    120 38.126.157.45
    +    177 35.237.175.180
    +    177 40.77.167.32
    +    216 66.249.75.219
    +    225 18.203.76.93
    +    261 46.101.86.248
    +    357 207.46.13.1
    +    903 54.70.40.11
    +
    Read more → @@ -173,16 +163,13 @@ sys 0m1.979s

    -

    2018-12-01

    - +

    2018-12-01

    • Switch CGSpace (linode18) to use OpenJDK instead of Oracle JDK
    • I manually installed OpenJDK, then removed Oracle JDK, then re-ran the Ansible playbook to update all configuration files, etc
    • Then I ran all system updates and restarted the server
    - -

    2018-12-02

    - +

    2018-12-02

    @@ -203,15 +190,12 @@ sys 0m1.979s

    -

    2018-11-01

    - +

    2018-11-01

    • Finalize AReS Phase I and Phase II ToRs
    • Send a note about my dspace-statistics-api to the dspace-tech mailing list
    - -

    2018-11-03

    - +

    2018-11-03

    • Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage
    • Today these are the top 10 IPs:
    • @@ -233,11 +217,10 @@ sys 0m1.979s

      -

      2018-10-01

      - +

      2018-10-01

      • Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
      • -
      • I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now
      • +
      • I created a GitHub issue to track this #389, because I'm super busy in Nairobi right now
      Read more → @@ -256,13 +239,12 @@ sys 0m1.979s

      -

      2018-09-02

      - +

      2018-09-02

      • New PostgreSQL JDBC driver version 42.2.5
      • -
      • I’ll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
      • -
      • Also, I’ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month
      • -
      • I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again:
      • +
      • I'll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
      • +
      • Also, I'll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month
      • +
      • I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:
      Read more → @@ -281,27 +263,20 @@ sys 0m1.979s

      -

      2018-08-01

      - +

      2018-08-01

        -
      • DSpace Test had crashed at some point yesterday morning and I see the following in dmesg:

        - +
      • DSpace Test had crashed at some point yesterday morning and I see the following in dmesg:
      • +
      [Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
       [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
       [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      -
      - -
    • Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight

    • - -
    • From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat’s

    • - -
    • I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…

    • - -
    • Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core

    • - -
    • The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes

    • - -
    • I ran all system updates on DSpace Test and rebooted it

    • +
        +
      • Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
      • +
      • From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat's
      • +
      • I'm not sure why Tomcat didn't crash with an OutOfMemoryError…
      • +
      • Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
      • +
      • The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes
      • +
      • I ran all system updates on DSpace Test and rebooted it
      Read more → @@ -320,19 +295,16 @@ sys 0m1.979s

      -

      2018-07-01

      - +

      2018-07-01

        -
      • I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:

        - -
        $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
        -
      • - -
      • During the mvn package stage on the 5.8 branch I kept getting issues with java running out of memory:

        - -
        There is insufficient memory for the Java Runtime Environment to continue.
        -
      • +
      • I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:
      +
      $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
      +
        +
      • During the mvn package stage on the 5.8 branch I kept getting issues with java running out of memory:
      • +
      +
      There is insufficient memory for the Java Runtime Environment to continue.
      +
      Read more → @@ -350,32 +322,27 @@ sys 0m1.979s

      -

      2018-06-04

      - +

      2018-06-04

      • Test the DSpace 5.8 module upgrades from Atmire (#378) -
          -
        • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn’t build
        • -
      • +
      • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn't build
      • +
      +
    • I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
    • - -
    • I proofed and tested the ILRI author corrections that Peter sent back to me this week:

      - +
    • I proofed and tested the ILRI author corrections that Peter sent back to me this week:
    • +
    $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
    -
    - -
  • I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018

  • - -
  • Time to index ~70,000 items on CGSpace:

    - +
      +
    • I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018
    • +
    • Time to index ~70,000 items on CGSpace:
    • +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
     
     real    74m42.646s
     user    8m5.056s
     sys     2m7.289s
    -
  • - + Read more → @@ -393,15 +360,14 @@ sys 2m7.289s

    -

    2018-05-01

    - +

    2018-05-01

    +
  • Then I reduced the JVM heap size from 6144 back to 5120m
  • Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the Ansible infrastructure scripts to support hosts choosing which distribution they want to use
  • diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index 247e322cc..6a6930c86 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -9,13 +9,12 @@ - - + @@ -85,10 +84,9 @@

    -

    2018-04-01

    - +

    2018-04-01

      -
    • I tried to test something on DSpace Test but noticed that it’s down since god knows when
    • +
    • I tried to test something on DSpace Test but noticed that it's down since god knows when
    • Catalina logs at least show some memory errors yesterday:
    Read more → @@ -108,8 +106,7 @@

    -

    2018-03-02

    - +

    2018-03-02

    • Export a CSV of the IITA community metadata for Martin Mueller
    @@ -130,13 +127,12 @@

    -

    2018-02-01

    - +

    2018-02-01

    • Peter gave feedback on the dc.rights proof of concept that I had sent him last week
    • -
    • We don’t need to distinguish between internal and external works, so that makes it just a simple list
    • +
    • We don't need to distinguish between internal and external works, so that makes it just a simple list
    • Yesterday I figured out how to monitor DSpace sessions using JMX
    • -
    • I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
    • +
    • I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu's munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
    Read more → @@ -155,33 +151,26 @@

    -

    2018-01-02

    - +

    2018-01-02

    • Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time
    • -
    • I didn’t get any load alerts from Linode and the REST and XMLUI logs don’t show anything out of the ordinary
    • +
    • I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary
    • The nginx logs show HTTP 200s until 02/Jan/2018:11:27:17 +0000 when Uptime Robot got an HTTP 500
    • In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
    • - -
    • And just before that I see this:

      - +
    • And just before that I see this:
    • +
    Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
    -
    - -
  • Ah hah! So the pool was actually empty!

  • - -
  • I need to increase that, let’s try to bump it up from 50 to 75

  • - -
  • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw

  • - -
  • I notice this error quite a few times in dspace.log:

    - +
      +
    • Ah hah! So the pool was actually empty!
    • +
    • I need to increase that, let's try to bump it up from 50 to 75
    • +
    • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw
    • +
    • I notice this error quite a few times in dspace.log:
    • +
    2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
     org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.
    -
  • - -
  • And there are many of these errors every day for the past month:

    - +
      +
    • And there are many of these errors every day for the past month:
    • +
    $ grep -c "Error while searching for sidebar facets" dspace.log.*
     dspace.log.2017-11-21:4
     dspace.log.2017-11-22:1
    @@ -226,9 +215,8 @@ dspace.log.2017-12-30:89
     dspace.log.2017-12-31:53
     dspace.log.2018-01-01:45
     dspace.log.2018-01-02:34
    -
  • - -
  • Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains

  • +
      +
    • Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains
    Read more → @@ -247,8 +235,7 @@ dspace.log.2018-01-02:34

    -

    2017-12-01

    - +

    2017-12-01

    • Uptime Robot noticed that CGSpace went down
    • The logs say “Timeout waiting for idle object”
    • @@ -272,27 +259,22 @@ dspace.log.2018-01-02:34

      -

      2017-11-01

      - +

      2017-11-01

      • The CORE developers responded to say they are looking into their bot not respecting our robots.txt
      - -

      2017-11-02

      - +

      2017-11-02

        -
      • Today there have been no hits by CORE and no alerts from Linode (coincidence?)

        - +
      • Today there have been no hits by CORE and no alerts from Linode (coincidence?)
      • +
      # grep -c "CORE" /var/log/nginx/access.log
       0
      -
      - -
    • Generate list of authors on CGSpace for Peter to go through and correct:

      - +
        +
      • Generate list of authors on CGSpace for Peter to go through and correct:
      • +
      dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
       COPY 54701
      -
    • -
    + Read more → @@ -310,17 +292,14 @@ COPY 54701

    -

    2017-10-01

    - +

    2017-10-01

    http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
    -
    - -
  • There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine

  • - -
  • Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections

  • +
      +
    • There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
    • +
    • Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
    Read more → diff --git a/docs/categories/page/2/index.html b/docs/categories/page/2/index.html index 6a9d3c5ea..f8aac8fb8 100644 --- a/docs/categories/page/2/index.html +++ b/docs/categories/page/2/index.html @@ -9,13 +9,12 @@ - - + @@ -100,40 +99,34 @@

    -

    2019-02-01

    - +

    2019-02-01

    • Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
    • - -
    • The top IPs before, during, and after this latest alert tonight were:

      - +
    • The top IPs before, during, and after this latest alert tonight were:
    • +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -245 207.46.13.5
    -332 54.70.40.11
    -385 5.143.231.38
    -405 207.46.13.173
    -405 207.46.13.75
    -1117 66.249.66.219
    -1121 35.237.175.180
    -1546 5.9.6.51
    -2474 45.5.186.2
    -5490 85.25.237.71
    -
    - -
  • 85.25.237.71 is the “Linguee Bot” that I first saw last month

  • - -
  • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase

  • - -
  • There were just over 3 million accesses in the nginx logs last month:

    - + 245 207.46.13.5 + 332 54.70.40.11 + 385 5.143.231.38 + 405 207.46.13.173 + 405 207.46.13.75 + 1117 66.249.66.219 + 1121 35.237.175.180 + 1546 5.9.6.51 + 2474 45.5.186.2 + 5490 85.25.237.71 +
      +
    • 85.25.237.71 is the “Linguee Bot” that I first saw last month
    • +
    • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
    • +
    • There were just over 3 million accesses in the nginx logs last month:
    • +
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
     3018243
     
     real    0m19.873s
     user    0m22.203s
     sys     0m1.979s
    -
  • - + Read more → @@ -151,26 +144,23 @@ sys 0m1.979s

    -

    2019-01-02

    - +

    2019-01-02

    • Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
    • - -
    • I don’t see anything interesting in the web server logs around that time though:

      - -
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      - 92 40.77.167.4
      - 99 210.7.29.100
      -120 38.126.157.45
      -177 35.237.175.180
      -177 40.77.167.32
      -216 66.249.75.219
      -225 18.203.76.93
      -261 46.101.86.248
      -357 207.46.13.1
      -903 54.70.40.11
      -
    • +
    • I don't see anything interesting in the web server logs around that time though:
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +     92 40.77.167.4
    +     99 210.7.29.100
    +    120 38.126.157.45
    +    177 35.237.175.180
    +    177 40.77.167.32
    +    216 66.249.75.219
    +    225 18.203.76.93
    +    261 46.101.86.248
    +    357 207.46.13.1
    +    903 54.70.40.11
    +
    Read more → @@ -188,16 +178,13 @@ sys 0m1.979s

    -

    2018-12-01

    - +

    2018-12-01

    • Switch CGSpace (linode18) to use OpenJDK instead of Oracle JDK
    • I manually installed OpenJDK, then removed Oracle JDK, then re-ran the Ansible playbook to update all configuration files, etc
    • Then I ran all system updates and restarted the server
    - -

    2018-12-02

    - +

    2018-12-02

    @@ -218,15 +205,12 @@ sys 0m1.979s

    -

    2018-11-01

    - +

    2018-11-01

    • Finalize AReS Phase I and Phase II ToRs
    • Send a note about my dspace-statistics-api to the dspace-tech mailing list
    - -

    2018-11-03

    - +

    2018-11-03

    • Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage
    • Today these are the top 10 IPs:
    • @@ -248,11 +232,10 @@ sys 0m1.979s

      -

      2018-10-01

      - +

      2018-10-01

      • Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
      • -
      • I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now
      • +
      • I created a GitHub issue to track this #389, because I'm super busy in Nairobi right now
      Read more → @@ -271,13 +254,12 @@ sys 0m1.979s

      -

      2018-09-02

      - +

      2018-09-02

      • New PostgreSQL JDBC driver version 42.2.5
      • -
      • I’ll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
      • -
      • Also, I’ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month
      • -
      • I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again:
      • +
      • I'll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
      • +
      • Also, I'll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month
      • +
      • I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:
      Read more → @@ -296,27 +278,20 @@ sys 0m1.979s

      -

      2018-08-01

      - +

      2018-08-01

        -
      • DSpace Test had crashed at some point yesterday morning and I see the following in dmesg:

        - +
      • DSpace Test had crashed at some point yesterday morning and I see the following in dmesg:
      • +
      [Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
       [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
       [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      -
      - -
    • Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight

    • - -
    • From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat’s

    • - -
    • I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…

    • - -
    • Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core

    • - -
    • The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes

    • - -
    • I ran all system updates on DSpace Test and rebooted it

    • +
        +
      • Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
      • +
      • From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat's
      • +
      • I'm not sure why Tomcat didn't crash with an OutOfMemoryError…
      • +
      • Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
      • +
      • The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes
      • +
      • I ran all system updates on DSpace Test and rebooted it
      Read more → @@ -335,19 +310,16 @@ sys 0m1.979s

      -

      2018-07-01

      - +

      2018-07-01

        -
      • I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:

        - -
        $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
        -
      • - -
      • During the mvn package stage on the 5.8 branch I kept getting issues with java running out of memory:

        - -
        There is insufficient memory for the Java Runtime Environment to continue.
        -
      • +
      • I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:
      +
      $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
      +
        +
      • During the mvn package stage on the 5.8 branch I kept getting issues with java running out of memory:
      • +
      +
      There is insufficient memory for the Java Runtime Environment to continue.
      +
      Read more → @@ -365,32 +337,27 @@ sys 0m1.979s

      -

      2018-06-04

      - +

      2018-06-04

      • Test the DSpace 5.8 module upgrades from Atmire (#378) -
          -
        • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn’t build
        • -
      • +
      • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn't build
      • +
      +
    • I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
    • - -
    • I proofed and tested the ILRI author corrections that Peter sent back to me this week:

      - +
    • I proofed and tested the ILRI author corrections that Peter sent back to me this week:
    • +
    $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
    -
    - -
  • I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018

  • - -
  • Time to index ~70,000 items on CGSpace:

    - +
      +
    • I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018
    • +
    • Time to index ~70,000 items on CGSpace:
    • +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
     
     real    74m42.646s
     user    8m5.056s
     sys     2m7.289s
    -
  • - + Read more → @@ -408,15 +375,14 @@ sys 2m7.289s

    -

    2018-05-01

    - +

    2018-05-01

    +
  • Then I reduced the JVM heap size from 6144 back to 5120m
  • Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the Ansible infrastructure scripts to support hosts choosing which distribution they want to use
  • diff --git a/docs/categories/page/3/index.html b/docs/categories/page/3/index.html index 2d2f5c71d..dddffcd70 100644 --- a/docs/categories/page/3/index.html +++ b/docs/categories/page/3/index.html @@ -9,13 +9,12 @@ - - + @@ -100,10 +99,9 @@

    -

    2018-04-01

    - +

    2018-04-01

      -
    • I tried to test something on DSpace Test but noticed that it’s down since god knows when
    • +
    • I tried to test something on DSpace Test but noticed that it's down since god knows when
    • Catalina logs at least show some memory errors yesterday:
    Read more → @@ -123,8 +121,7 @@

    -

    2018-03-02

    - +

    2018-03-02

    • Export a CSV of the IITA community metadata for Martin Mueller
    @@ -145,13 +142,12 @@

    -

    2018-02-01

    - +

    2018-02-01

    • Peter gave feedback on the dc.rights proof of concept that I had sent him last week
    • -
    • We don’t need to distinguish between internal and external works, so that makes it just a simple list
    • +
    • We don't need to distinguish between internal and external works, so that makes it just a simple list
    • Yesterday I figured out how to monitor DSpace sessions using JMX
    • -
    • I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
    • +
    • I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu's munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
    Read more → @@ -170,33 +166,26 @@

    -

    2018-01-02

    - +

    2018-01-02

    • Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time
    • -
    • I didn’t get any load alerts from Linode and the REST and XMLUI logs don’t show anything out of the ordinary
    • +
    • I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary
    • The nginx logs show HTTP 200s until 02/Jan/2018:11:27:17 +0000 when Uptime Robot got an HTTP 500
    • In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
    • - -
    • And just before that I see this:

      - +
    • And just before that I see this:
    • +
    Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
    -
    - -
  • Ah hah! So the pool was actually empty!

  • - -
  • I need to increase that, let’s try to bump it up from 50 to 75

  • - -
  • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw

  • - -
  • I notice this error quite a few times in dspace.log:

    - +
      +
    • Ah hah! So the pool was actually empty!
    • +
    • I need to increase that, let's try to bump it up from 50 to 75
    • +
    • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw
    • +
    • I notice this error quite a few times in dspace.log:
    • +
    2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
     org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.
    -
  • - -
  • And there are many of these errors every day for the past month:

    - +
      +
    • And there are many of these errors every day for the past month:
    • +
    $ grep -c "Error while searching for sidebar facets" dspace.log.*
     dspace.log.2017-11-21:4
     dspace.log.2017-11-22:1
    @@ -241,9 +230,8 @@ dspace.log.2017-12-30:89
     dspace.log.2017-12-31:53
     dspace.log.2018-01-01:45
     dspace.log.2018-01-02:34
    -
  • - -
  • Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains

  • +
      +
    • Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains
    Read more → @@ -262,8 +250,7 @@ dspace.log.2018-01-02:34

    -

    2017-12-01

    - +

    2017-12-01

    • Uptime Robot noticed that CGSpace went down
    • The logs say “Timeout waiting for idle object”
    • @@ -287,27 +274,22 @@ dspace.log.2018-01-02:34

      -

      2017-11-01

      - +

      2017-11-01

      • The CORE developers responded to say they are looking into their bot not respecting our robots.txt
      - -

      2017-11-02

      - +

      2017-11-02

        -
      • Today there have been no hits by CORE and no alerts from Linode (coincidence?)

        - +
      • Today there have been no hits by CORE and no alerts from Linode (coincidence?)
      • +
      # grep -c "CORE" /var/log/nginx/access.log
       0
      -
      - -
    • Generate list of authors on CGSpace for Peter to go through and correct:

      - +
        +
      • Generate list of authors on CGSpace for Peter to go through and correct:
      • +
      dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
       COPY 54701
      -
    • -
    + Read more → @@ -325,17 +307,14 @@ COPY 54701

    -

    2017-10-01

    - +

    2017-10-01

    http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
    -
    - -
  • There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine

  • - -
  • Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections

  • +
      +
    • There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
    • +
    • Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
    Read more → @@ -374,16 +353,13 @@ COPY 54701

    -

    2017-09-06

    - +

    2017-09-06

    • Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two hours
    - -

    2017-09-07

    - +

    2017-09-07

      -
    • Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group
    • +
    • Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group
    Read more → @@ -402,22 +378,21 @@ COPY 54701

    -

    2017-08-01

    - +

    2017-08-01

    • Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
    • I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
    • The good thing is that, according to dspace.log.2017-08-01, they are all using the same Tomcat session
    • This means our Tomcat Crawler Session Valve is working
    • But many of the bots are browsing dynamic URLs like: -
      • /handle/10568/3353/discover
      • /handle/10568/16510/browse
      • -
    • +
    +
  • The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these!
  • Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
  • -
  • It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
  • +
  • It turns out that we're already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
  • Also, the bot has to successfully browse the page first so it can receive the HTTP header…
  • We might actually have to block these requests with HTTP 403 depending on the user agent
  • Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
  • diff --git a/docs/categories/page/4/index.html b/docs/categories/page/4/index.html index 2cd251657..b7f945afd 100644 --- a/docs/categories/page/4/index.html +++ b/docs/categories/page/4/index.html @@ -9,13 +9,12 @@ - - + @@ -100,18 +99,15 @@

    -

    2017-07-01

    - +

    2017-07-01

    • Run system updates and reboot DSpace Test
    - -

    2017-07-04

    - +

    2017-07-04

    • Merge changes for WLE Phase II theme rename (#329)
    • -
    • Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace
    • -
    • We can use PostgreSQL’s extended output format (-x) plus sed to format the output into quasi XML:
    • +
    • Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace
    • +
    • We can use PostgreSQL's extended output format (-x) plus sed to format the output into quasi XML:
    Read more → @@ -130,7 +126,7 @@

    - 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we’ll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. + 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we'll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. Read more → @@ -148,7 +144,7 @@

    - 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it’s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire’s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. + 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it's a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire's CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. Read more → @@ -166,23 +162,18 @@

    -

    2017-04-02

    - +

    2017-04-02

    • Merge one change to CCAFS flagships that I had forgotten to remove last month (“MANAGING CLIMATE RISK”): https://github.com/ilri/DSpace/pull/317
    • Quick proof-of-concept hack to add dc.rights to the input form, including some inline instructions/hints:
    - -

    dc.rights in the submission form

    - +

    dc.rights in the submission form

    • Remove redundant/duplicate text in the DSpace submission license
    • - -
    • Testing the CMYK patch on a collection with 650 items:

      - -
      $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
      -
    • +
    • Testing the CMYK patch on a collection with 650 items:
    +
    $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
    +
    Read more → @@ -200,14 +191,11 @@

    -

    2017-03-01

    - +

    2017-03-01

    • Run the 279 CIAT author corrections on CGSpace
    - -

    2017-03-02

    - +

    2017-03-02

    • Skype with Michael and Peter, discussing moving the CGIAR Library to CGSpace
    • CGIAR people possibly open to moving content, redirecting library.cgiar.org to CGSpace and letting CGSpace resolve their handles
    • @@ -217,13 +205,11 @@
    • Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
    • Filed an issue on DSpace issue tracker for the filter-media bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
    • Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
    • - -
    • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 1056851999):

      - +
    • Interestingly, it seems DSpace 4.x's thumbnails were sRGB, but forcing regeneration using DSpace 5.x's ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
    • +
    $ identify ~/Desktop/alc_contrastes_desafios.jpg
     /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
    -
    - + Read more → @@ -241,25 +227,22 @@

    -

    2017-02-07

    - +

    2017-02-07

      -
    • An item was mapped twice erroneously again, so I had to remove one of the mappings manually:

      - +
    • An item was mapped twice erroneously again, so I had to remove one of the mappings manually:
    • +
    dspace=# select * from collection2item where item_id = '80278';
    -id   | collection_id | item_id
    +  id   | collection_id | item_id
     -------+---------------+---------
    -92551 |           313 |   80278
    -92550 |           313 |   80278
    -90774 |          1051 |   80278
    + 92551 |           313 |   80278
    + 92550 |           313 |   80278
    + 90774 |          1051 |   80278
     (3 rows)
     dspace=# delete from collection2item where id = 92551 and item_id = 80278;
     DELETE 1
    -
    - -
  • Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)

  • - -
  • Looks like we’ll be using cg.identifier.ccafsprojectpii as the field name

  • +
      +
    • Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
    • +
    • Looks like we'll be using cg.identifier.ccafsprojectpii as the field name
    Read more → @@ -278,12 +261,11 @@ DELETE 1

    -

    2017-01-02

    - +

    2017-01-02

    • I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error
    • -
    • I tested on DSpace Test as well and it doesn’t work there either
    • -
    • I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years
    • +
    • I tested on DSpace Test as well and it doesn't work there either
    • +
    • I asked on the dspace-tech mailing list because it seems to be broken, and actually now I'm not sure if we've ever had the sharding task run successfully over all these years
    Read more → @@ -302,25 +284,20 @@ DELETE 1

    -

    2016-12-02

    - +

    2016-12-02

    • CGSpace was down for five hours in the morning while I was sleeping
    • - -
    • While looking in the logs for errors, I see tons of warnings about Atmire MQM:

      - +
    • While looking in the logs for errors, I see tons of warnings about Atmire MQM:
    • +
    2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
    -
    - -
  • I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade

  • - -
  • I’ve raised a ticket with Atmire to ask

  • - -
  • Another worrying error from dspace.log is:

  • +
      +
    • I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade
    • +
    • I've raised a ticket with Atmire to ask
    • +
    • Another worrying error from dspace.log is:
    Read more → @@ -339,13 +316,11 @@ DELETE 1

    -

    2016-11-01

    - +

    2016-11-01

      -
    • Add dc.type to the output options for Atmire’s Listings and Reports module (#286)
    • +
    • Add dc.type to the output options for Atmire's Listings and Reports module (#286)
    - -

    Listings and Reports with output type

    +

    Listings and Reports with output type

    Read more → @@ -363,22 +338,19 @@ DELETE 1

    -

    2016-10-03

    - +

    2016-10-03

    • Testing adding ORCIDs to a CSV file for a single item to see if the author orders get messed up
    • Need to test the following scenarios to see how author order is affected: -
      • ORCIDs only
      • ORCIDs plus normal authors
      • -
    • - -
    • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:

      - -
      0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
      -
    + +
  • I exported a random item's metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
  • + +
    0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
    +
    Read more → diff --git a/docs/categories/page/5/index.html b/docs/categories/page/5/index.html index 8fb66ec00..ff4aa079d 100644 --- a/docs/categories/page/5/index.html +++ b/docs/categories/page/5/index.html @@ -9,13 +9,12 @@ - - + @@ -100,18 +99,15 @@

    -

    2016-09-01

    - +

    2016-09-01

    • Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors
    • -
    • Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace
    • +
    • Discuss how the migration of CGIAR's Active Directory to a flat structure will break our LDAP groups in DSpace
    • We had been using DC=ILRI to determine whether a user was ILRI or not
    • - -
    • It looks like we might be able to use OUs now, instead of DCs:

      - -
      $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
      -
    • +
    • It looks like we might be able to use OUs now, instead of DCs:
    +
    $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
    +
    Read more → @@ -129,22 +125,19 @@

    -

    2016-08-01

    - +

    2016-08-01

    • Add updated distribution license from Sisay (#259)
    • Play with upgrading Mirage 2 dependencies in bower.json because most are several versions of out date
    • Bootstrap is at 3.3.0 but upstream is at 3.3.7, and upgrading to anything beyond 3.3.1 breaks glyphicons and probably more
    • bower stuff is a dead end, waste of time, too many issues
    • Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of fonts)
    • - -
    • Start working on DSpace 5.1 → 5.5 port:

      - +
    • Start working on DSpace 5.1 → 5.5 port:
    • +
    $ git checkout -b 55new 5_x-prod
     $ git reset --hard ilri/5_x-prod
     $ git rebase -i dspace-5.5
    -
    - + Read more → @@ -162,22 +155,19 @@ $ git rebase -i dspace-5.5

    -

    2016-07-01

    - +

    2016-07-01

    • Add dc.description.sponsorship to Discovery sidebar facets and make investors clickable in item view (#232)
    • - -
    • I think this query should find and replace all authors that have “,” at the end of their names:

      - +
    • I think this query should find and replace all authors that have “,” at the end of their names:
    • +
    dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
     UPDATE 95
     dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
    -text_value
    + text_value
     ------------
     (0 rows)
    -
    - -
  • In this case the select query was showing 95 results before the update

  • +
      +
    • In this case the select query was showing 95 results before the update
    Read more → @@ -196,11 +186,10 @@ text_value

    -

    2016-06-01

    - +

    2016-06-01

    + Read more → @@ -252,13 +238,12 @@ text_value

    -

    2016-04-04

    - +

    2016-04-04

    • Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit
    • We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc
    • -
    • After running DSpace for over five years I’ve never needed to look in any other log file than dspace.log, leave alone one from last year!
    • -
    • This will save us a few gigs of backup space we’re paying for on S3
    • +
    • After running DSpace for over five years I've never needed to look in any other log file than dspace.log, leave alone one from last year!
    • +
    • This will save us a few gigs of backup space we're paying for on S3
    • Also, I noticed the checker log has some errors we should pay attention to:
    Read more → @@ -278,11 +263,10 @@ text_value

    -

    2016-03-02

    - +

    2016-03-02

    • Looking at issues with author authorities on CGSpace
    • -
    • For some reason we still have the index-lucene-update cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module
    • +
    • For some reason we still have the index-lucene-update cron job active on CGSpace, but I'm pretty sure we don't need it as of the latest few versions of Atmire's Listings and Reports module
    • Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server
    Read more → @@ -302,16 +286,13 @@ text_value

    -

    2016-02-05

    - +

    2016-02-05

    • Looking at some DAGRIS data for Abenet Yabowork
    • Lots of issues with spaces, newlines, etc causing the import to fail
    • I noticed we have a very interesting list of countries on CGSpace:
    - -

    CGSpace country list

    - +

    CGSpace country list

    • Not only are there 49,000 countries, we have some blanks (25)…
    • Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE”
    • @@ -333,8 +314,7 @@ text_value

      -

      2016-01-13

      - +

      2016-01-13

      • Move ILRI collection 10568/12503 from 10568/27869 to 10568/27629 using the move_collections.sh script I wrote last year.
      • I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.
      • @@ -357,18 +337,16 @@ text_value

        -

        2015-12-02

        - +

        2015-12-02

          -
        • Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less space:

          - +
        • Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less space:
        • +
        # cd /home/dspacetest.cgiar.org/log
         # ls -lh dspace.log.2015-11-18*
         -rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
         -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
         -rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz
        -
        -
      + Read more → diff --git a/docs/categories/page/6/index.html b/docs/categories/page/6/index.html index f9c407132..925a1f413 100644 --- a/docs/categories/page/6/index.html +++ b/docs/categories/page/6/index.html @@ -9,13 +9,12 @@ - - + @@ -100,18 +99,15 @@

      -

      2015-11-22

      - +

      2015-11-22

      • CGSpace went down
      • Looks like DSpace exhausted its PostgreSQL connection pool
      • - -
      • Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:

        - +
      • Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:
      • +
      $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
       78
      -
      -
    + Read more → diff --git a/docs/cgiar-library-migration/index.html b/docs/cgiar-library-migration/index.html index ebfe927b6..d97360fba 100644 --- a/docs/cgiar-library-migration/index.html +++ b/docs/cgiar-library-migration/index.html @@ -15,7 +15,7 @@ - + @@ -25,7 +25,7 @@ "@type": "BlogPosting", "headline": "CGIAR Library Migration", "url": "https:\/\/alanorth.github.io\/cgspace-notes\/cgiar-library-migration\/", - "wordCount": "1285", + "wordCount": "1278", "datePublished": "2017-09-18T16:38:35+03:00", "dateModified": "2019-10-28T13:40:20+02:00", "author": { @@ -100,47 +100,38 @@

    Rough notes for importing the CGIAR Library content. It was decided that this content would go to a new top-level community called CGIAR System Organization.

    - -

    Pre-migration Technical TODOs

    - +

    Pre-migration Technical TODOs

    Things that need to happen before the migration:

    - -
      -
    • + + +
    • Temporarily disable nightly index-discovery cron job because the import process will be taking place during some of this time and I don't want them to be competing to update the Solr index
    • +
    • Copy HTTPS certificate key pair from CGIAR Library server's Tomcat keystore:
    • +
    $ keytool -list -keystore tomcat.keystore
     $ keytool -importkeystore -srckeystore tomcat.keystore -destkeystore library.cgiar.org.p12 -deststoretype PKCS12 -srcalias tomcat
     $ openssl pkcs12 -in library.cgiar.org.p12 -nokeys -out library.cgiar.org.crt.pem
     $ openssl pkcs12 -in library.cgiar.org.p12 -nodes -nocerts -out library.cgiar.org.key.pem
     $ wget https://certs.godaddy.com/repository/gdroot-g2.crt https://certs.godaddy.com/repository/gdig2.crt.pem
     $ cat library.cgiar.org.crt.pem gdig2.crt.pem > library.cgiar.org-chained.pem
    -
    - - -

    Migration Process

    - +

    Migration Process

    Export all top-level communities and collections from DSpace Test:

    -
    $ export PATH=$PATH:/home/dspacetest.cgiar.org/bin
     $ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10947/2515 10947-2515/10947-2515.zip
     $ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10947/2516 10947-2516/10947-2516.zip
    @@ -154,21 +145,16 @@ $ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10947/2527 10947-2527/10947
     $ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10568/93759 10568-93759/10568-93759.zip
     $ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10568/93760 10568-93760/10568-93760.zip
     $ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10947/1 10947-1/10947-1.zip
    -
    - -

    Import to CGSpace (also see notes from 2017-05-10):

    - -
      -
    • - -
    • [x] Add ingestion overrides to dspace.cfg before import:

      - +

      Import to CGSpace (also see notes from 2017-05-10):

      +
        +
      • Copy all exports from DSpace Test
      • +
      • Add ingestion overrides to dspace.cfg before import:
      • +
      mets.dspaceAIP.ingest.crosswalk.METSRIGHTS = NIL
       mets.dspaceAIP.ingest.crosswalk.DSPACE-ROLES = NIL
      -
    • - -
    • [x] Import communities and collections, paying attention to options to skip missing parents and ignore handles:

      - +
        +
      • Import communities and collections, paying attention to options to skip missing parents and ignore handles:
      • +
      $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
       $ export PATH=$PATH:/home/cgspace.cgiar.org/bin
       $ dspace packager -r -u -a -t AIP -o skipIfParentMissing=true -e aorth@mjanja.ch -p 10568/83389 10947-2515/10947-2515.zip
      @@ -185,65 +171,45 @@ $ for item in 10947-2527/ITEM@10947-*; do dspace packager -r -f -u -t AIP -e aor
       $ dspace packager -s -t AIP -o ignoreHandle=false -e aorth@mjanja.ch -p 10568/83389 10947-1/10947-1.zip
       $ for collection in 10947-1/COLLECTION@10947-*; do dspace packager -s -o ignoreHandle=false -t AIP -e aorth@mjanja.ch -p 10947/1 $collection; done
       $ for item in 10947-1/ITEM@10947-*; do dspace packager -r -f -u -t AIP -e aorth@mjanja.ch $item; done
      -
    • -
    - -

    This submits AIP hierarchies recursively (-r) and suppresses errors when an item’s parent collection hasn’t been created yet—for example, if the item is mapped. The large historic archive (109471) is created in several steps because it requires a lot of memory and often crashes.

    - +

    This submits AIP hierarchies recursively (-r) and suppresses errors when an item's parent collection hasn't been created yet—for example, if the item is mapped. The large historic archive (10947/1) is created in several steps because it requires a lot of memory and often crashes.

    Create new subcommunities and collections for content we reorganized into new hierarchies from the original:

    - -
      -
    • [x] Create CGIAR System Management Board sub-community: 10568/83536

      - -
        -
      • - -
      • Import collection hierarchy first and then the items:

        - +
          +
        • Create CGIAR System Management Board sub-community: 10568/83536 +
            +
          • Content from CGIAR System Management Board documents collection (10947/4561) goes here
          • +
          • Import collection hierarchy first and then the items:
          • +
          +
        • +
        $ dspace packager -r -t AIP -o ignoreHandle=false -e aorth@mjanja.ch -p 10568/83536 10568-93760/COLLECTION@10947-4651.zip
         $ for item in 10568-93760/ITEM@10947-465*; do dspace packager -r -f -u -t AIP -e aorth@mjanja.ch $item; done
        -
      • -
    • - -
    • [x] Create CGIAR System Management Office sub-community: 10568/83537

      - -
        -
      • - -
      • Import items to collection individually in replace mode (-r) while explicitly preserving handles and ignoring parents:

        - -
        $ for item in 10568-93759/ITEM@10947-46*; do dspace packager -r -t AIP -o ignoreHandle=false -o ignoreParent=true -e aorth@mjanja.ch -p 10568/83538 $item; done
        -
      • -
    • -
    - -

    Get the handles for the last few items from CGIAR Library that were created since we did the migration to DSpace Test in May:

    - -
    dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select item_id from metadatavalue where metadata_field_id=11 and date(text_value) > '2017-05-01T00:00:00Z');
    -
    - +
      +
    • Create CGIAR System Management Office sub-community: 10568/83537
        -
      • Export them from the CGIAR Library:

        - -
        # for handle in 10947/4658 10947/4659 10947/4660 10947/4661 10947/4665 10947/4664 10947/4666 10947/4669; do /usr/local/dspace/bin/dspace packager -d -a -t AIP -e m.marus@cgiar.org -i $handle ${handle}.zip; done
        -
      • - -
      • Import on CGSpace:

        - -
        $ for item in 10947-latest/*.zip; do dspace packager -r -u -t AIP -e aorth@mjanja.ch $item; done
        -
      • +
      • Create CGIAR System Management Office documents collection: 10568/83538
      • +
      • Import items to collection individually in replace mode (-r) while explicitly preserving handles and ignoring parents:
      • +
      +
    • +
    +
    $ for item in 10568-93759/ITEM@10947-46*; do dspace packager -r -t AIP -o ignoreHandle=false -o ignoreParent=true -e aorth@mjanja.ch -p 10568/83538 $item; done
    +

    Get the handles for the last few items from CGIAR Library that were created since we did the migration to DSpace Test in May:

    +
    dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select item_id from metadatavalue where metadata_field_id=11 and date(text_value) > '2017-05-01T00:00:00Z');
    +
      +
    • Export them from the CGIAR Library:
    • +
    +
    # for handle in 10947/4658 10947/4659 10947/4660 10947/4661 10947/4665 10947/4664 10947/4666 10947/4669; do /usr/local/dspace/bin/dspace packager -d -a -t AIP -e m.marus@cgiar.org -i $handle ${handle}.zip; done
    +
      +
    • Import on CGSpace:
    • +
    +
    $ for item in 10947-latest/*.zip; do dspace packager -r -u -t AIP -e aorth@mjanja.ch $item; done
    +

    Post Migration

    +
      +
    • Shut down Tomcat and run update-sequences.sql as the system's postgres user
    • +
    • Remove ingestion overrides from dspace.cfg
    • +
    • Reset PostgreSQL max_connections to 183
    • +
    • Enable nightly index-discovery cron job
    • +
    • Adjust CGSpace's handle-server/config.dct to add the new prefix alongside our existing 10568, ie:
    - -

    Post Migration

    - -
      -
    • -
    • -
    • -
    • - -
    • [x] Adjust CGSpace’s handle-server/config.dct to add the new prefix alongside our existing 10568, ie:

      -
      "server_admins" = (
       "300:0.NA/10568"
       "300:0.NA/10947"
      @@ -258,54 +224,33 @@ $ for item in 10568-93760/ITEM@10947-465*; do dspace packager -r -f -u -t AIP -e
       "300:0.NA/10568"
       "300:0.NA/10947"
       )
      -
    • -
    - -

    I had been regenerated the sitebndl.zip file on the CGIAR Library server and sent it to the Handle.net admins but they said that there were mismatches between the public and private keys, which I suspect is due to make-handle-config not being very flexible. After discussing our scenario with the Handle.net admins they said we actually don’t need to send an updated sitebndl.zip for this type of change, and the above config.dct edits are all that is required. I guess they just did something on their end by setting the authoritative IP address for the 10947 prefix to be the same as ours…

    - -
      -
    • +
    • Re-deploy DSpace from freshly built 5_x-prod branch
    • +
    • Merge cgiar-library branch to master and re-run ansible nginx templates
    • +
    • Run system updates and reboot server
    • +
    • Switch to Let's Encrypt HTTPS certificates (after DNS is updated and server isn't busy):
    • +
    $ sudo systemctl stop nginx
     $ /opt/certbot-auto certonly --standalone -d library.cgiar.org
     $ sudo systemctl start nginx
    -
    - - -

    Troubleshooting

    - +

    Troubleshooting

    Foreign Key Error in dspace cleanup

    -

    The cleanup script is sometimes used during import processes to clean the database and assetstore after failed AIP imports. If you see the following error with dspace cleanup -v:

    -
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"                                                                                                                       
       Detail: Key (bitstream_id)=(119841) is still referenced from table "bundle".
    -
    - -

    The solution is to set the primary_bitstream_id to NULL in PostgreSQL:

    - +

    The solution is to set the primary_bitstream_id to NULL in PostgreSQL:

    dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (119841);
    -
    - -

    PSQLException During AIP Ingest

    - +

    PSQLException During AIP Ingest

    After a few rounds of ingesting—possibly with failures—you might end up with inconsistent IDs in the database. In this case, during AIP ingest of a single collection in submit mode (-s):

    -
    org.dspace.content.packager.PackageValidationException: Exception while ingesting 10947-2527/10947-2527.zip, Reason: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "handle_pkey"                                    
       Detail: Key (handle_id)=(86227) already exists.
    -
    - -

    The normal solution is to run the update-sequences.sql script (with Tomcat shut down) but it doesn’t seem to work in this case. Finding the maximum handle_id and manually updating the sequence seems to work:

    - +

    The normal solution is to run the update-sequences.sql script (with Tomcat shut down) but it doesn't seem to work in this case. Finding the maximum handle_id and manually updating the sequence seems to work:

    dspace=# select * from handle where handle_id=(select max(handle_id) from handle);
     dspace=# select setval('handle_seq',86873);
     
    diff --git a/docs/cgspace-cgcorev2-migration/index.html b/docs/cgspace-cgcorev2-migration/index.html index bab02f77a..9bfa2e9f5 100644 --- a/docs/cgspace-cgcorev2-migration/index.html +++ b/docs/cgspace-cgcorev2-migration/index.html @@ -15,7 +15,7 @@ - + @@ -100,71 +100,64 @@

    Possible changes to CGSpace metadata fields to align more with DC, QDC, and DCTERMS as well as CG Core v2.

    -

    With reference to CG Core v2 draft standard by Marie-Angélique as well as DCMI DCTERMS.

    - -

    Proposed Changes

    -

    As of 2019-11-17 the scope of the changes includes the following fields:

    -
    • cg.creator.id→cg.creator.identifier -
      • ORCID identifiers
      • -
    • +
    +
  • dc.format.extent→dcterms.extent
  • dc.date.issued→dcterms.issued
  • dc.description.abstract→dcterms.abstract
  • dc.description→dcterms.description
  • dc.description.sponsorship→cg.contributor.donor -
    • values from CrossRef or Grid.ac if possible
    • -
  • + +
  • dc.description.version→cg.peer-reviewed
  • cg.fulltextstatus→cg.howpublished -
    • CGSpace uses values like “Formally Published” or “Grey Literature”
    • -
  • + +
  • dc.identifier.citation→dcterms.bibliographicCitation
  • cg.identifier.status→dcterms.accessRights -
    • current values are “Open Access” and “Limited Access”
    • future values are possibly “Open” and “Restricted”?
    • -
  • + +
  • dc.language.iso→dcterms.language -
    • current values are ISO 639-1 (aka Alpha 2)
    • future values are possibly ISO 639-3 (aka Alpha 3)?
    • -
  • + +
  • cg.link.reference→dcterms.relation
  • dc.publisher→dcterms.publisher
  • dc.relation.ispartofseries→dcterms.isPartOf
  • dc.rights→dcterms.license -
  • + +
  • dc.source→cg.journal
  • dc.subject→dcterms.subject
  • dc.type→dcterms.type
  • dc.identifier.isbn→cg.isbn
  • dc.identifier.issn→cg.issn
  • -

    The following fields are currently out of the scope of this migration because they are used internally by DSpace 5.x/6.x and would be difficult to change without significant modifications to the core of the code:

    -
    • dc.title (IncludePageMeta.java only considers DC when building pageMeta, which we rely on in XMLUI because of XSLT from DRI)
    • dc.title.alternative
    • @@ -174,36 +167,27 @@
    • dc.description.provenance
    • dc.contributor.author (IncludePageMeta.java only considers DC when building pageMeta, which we rely on in XMLUI because of XSLT from DRI)
    -

    Fields to Create

    -

    Make sure the following fields exist:

    - -
      -
    • -
    • -
    • -
    • -
    • -
    • -
    • +
        +
      • cg.creator.identifier (242)
      • +
      • cg.contributor.donor (243)
      • +
      • cg.peer-reviewed (244)
      • +
      • cg.howpublished (245)
      • +
      • cg.journal (246)
      • +
      • cg.isbn (247)
      • +
      • cg.issn (248)
      -

      Fields to delete

      -

      Fields to delete after migration:

      - -
        -
      • -
      • -
      • -
      • +
          +
        • cg.creator.id
        • +
        • cg.fulltextstatus
        • +
        • cg.identifier.status
        • +
        • cg.link.reference
        -

        Implementation Progress

        -

        Tally of the status of the implementation of the new fields in the CGSpace 5_x-cgcorev2 branch.

        - @@ -217,7 +201,6 @@ - @@ -229,7 +212,6 @@ - @@ -240,7 +222,6 @@ - @@ -251,7 +232,6 @@ - @@ -262,7 +242,6 @@ - @@ -273,7 +252,6 @@ - @@ -284,7 +262,6 @@ - @@ -295,7 +272,6 @@ - @@ -306,7 +282,6 @@ - @@ -317,7 +292,6 @@ - @@ -328,7 +302,6 @@ - @@ -339,7 +312,6 @@ - @@ -350,7 +322,6 @@ - @@ -361,7 +332,6 @@ - @@ -372,7 +342,6 @@ - @@ -383,7 +352,6 @@ - @@ -394,7 +362,6 @@ - @@ -405,7 +372,6 @@ - @@ -416,7 +382,6 @@ - @@ -427,7 +392,6 @@ - @@ -440,19 +404,14 @@
        Crosswalks
        cg.creator.identifier
        dcterms.extent -
        dcterms.issued
        dcterms.abstract -
        dcterms.description
        cg.contributor.donor
        cg.peer-reviewed -
        cg.howpublished -
        dcterms.bibliographicCitation
        dcterms.accessRights
        dcterms.language
        dcterms.relation -
        dcterms.publisher
        dcterms.isPartOf
        dcterms.license
        cg.journal
        dcterms.subject
        dcterms.type
        cg.isbn
        cg.issn
        -

        There are a few things that I need to check once I get a deployment of this code up and running:

        -
        • Assess the XSL changes to see if things like not(@qualifier)] still make sense after we move fields from DC to DCTERMS, as some fields will no longer have qualifiers
        • Do I need to edit crosswalks that we are not using, like MODS?
        • There is potentially a lot of work in the OAI metadata formats like DIM, METS, and QDC (see dspace/config/crosswalks/oai/*.xsl)
        - -
        - -

        ¹ Not committed yet because I don’t want to have to make minor adjustments in multiple commits. Re-apply the gauntlet of fixes with the sed script:

        - +
        +

        ¹ Not committed yet because I don't want to have to make minor adjustments in multiple commits. Re-apply the gauntlet of fixes with the sed script:

        $ find dspace/modules/xmlui-mirage2/src/main/webapp/themes -iname "*.xsl" -exec sed -i -f ./cgcore-xsl-replacements.sed {} \;
         
        diff --git a/docs/index.html b/docs/index.html index 32e610710..08f9c7235 100644 --- a/docs/index.html +++ b/docs/index.html @@ -9,13 +9,12 @@ - - + @@ -100,31 +99,27 @@

        -

        2019-11-04

        - +

        2019-11-04

          -
        • Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics

          - +
        • Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics
            -
          • I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:

            - +
          • I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:
          • +
          +
        • +
        # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
         4671942
         # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
         1277694
        -
        -
      - -
    • So 4.6 million from XMLUI and another 1.2 million from API requests

    • - -
    • Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):

      - +
        +
      • So 4.6 million from XMLUI and another 1.2 million from API requests
      • +
      • Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
      • +
      # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
       1183456 
       # zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
       106781
      -
    • -
    + Read more → @@ -145,7 +140,6 @@

    Possible changes to CGSpace metadata fields to align more with DC, QDC, and DCTERMS as well as CG Core v2.

    -

    With reference to CG Core v2 draft standard by Marie-Angélique as well as DCMI DCTERMS.

    Read more → @@ -164,8 +158,7 @@

    - 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace - I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: + 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's “unneccesary Unicode” fix: $ csvcut -c 'id,dc. Read more → @@ -183,37 +176,34 @@

    -

    2019-09-01

    - +

    2019-09-01

    • Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning
    • - -
    • Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:

      - -
      # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      -440 17.58.101.255
      -441 157.55.39.101
      -485 207.46.13.43
      -728 169.60.128.125
      -730 207.46.13.108
      -758 157.55.39.9
      -808 66.160.140.179
      -814 207.46.13.212
      -2472 163.172.71.23
      -6092 3.94.211.189
      -# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      - 33 2a01:7e00::f03c:91ff:fe16:fcb
      - 57 3.83.192.124
      - 57 3.87.77.25
      - 57 54.82.1.8
      -822 2a01:9cc0:47:1:1a:4:0:2
      -1223 45.5.184.72
      -1633 172.104.229.92
      -5112 205.186.128.185
      -7249 2a01:7e00::f03c:91ff:fe18:7396
      -9124 45.5.186.2
      -
    • +
    • Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
    +
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    440 17.58.101.255
    +    441 157.55.39.101
    +    485 207.46.13.43
    +    728 169.60.128.125
    +    730 207.46.13.108
    +    758 157.55.39.9
    +    808 66.160.140.179
    +    814 207.46.13.212
    +   2472 163.172.71.23
    +   6092 3.94.211.189
    +# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +     33 2a01:7e00::f03c:91ff:fe16:fcb
    +     57 3.83.192.124
    +     57 3.87.77.25
    +     57 54.82.1.8
    +    822 2a01:9cc0:47:1:1a:4:0:2
    +   1223 45.5.184.72
    +   1633 172.104.229.92
    +   5112 205.186.128.185
    +   7249 2a01:7e00::f03c:91ff:fe18:7396
    +   9124 45.5.186.2
    +
    Read more → @@ -231,22 +221,19 @@

    -

    2019-08-03

    - +

    2019-08-03

      -
    • Look at Bioversity’s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…
    • +
    • Look at Bioversity's latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…
    - -

    2019-08-04

    - +

    2019-08-04

    • Deploy ORCID identifier updates requested by Bioversity to CGSpace
    • Run system updates on CGSpace (linode18) and reboot it -
      • Before updating it I checked Solr and verified that all statistics cores were loaded properly…
      • -
      • After rebooting, all statistics cores were loaded… wow, that’s lucky.
      • -
    • +
    • After rebooting, all statistics cores were loaded… wow, that's lucky.
    • +
    +
  • Run system updates on DSpace Test (linode19) and reboot it
  • Read more → @@ -266,16 +253,15 @@

    -

    2019-07-01

    - +

    2019-07-01

    • Create an “AfricaRice books and book chapters” collection on CGSpace for AfricaRice
    • Last month Sisay asked why the following “most popular” statistics link for a range of months in 2018 works for the CIAT community on DSpace Test, but not on CGSpace: -
    • +
    +
  • Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community
  • Read more → @@ -295,15 +281,12 @@

    -

    2019-06-02

    - +

    2019-06-02

    - -

    2019-06-03

    - +

    2019-06-03

    • Skype with Marie-Angélique and Abenet about CG Core v2
    @@ -324,24 +307,21 @@

    -

    2019-05-01

    - +

    2019-05-01

    • Help CCAFS with regenerating some item thumbnails after they uploaded new PDFs to some items on CGSpace
    • A user on the dspace-tech mailing list offered some suggestions for troubleshooting the problem with the inability to delete certain items -
      • Apparently if the item is in the workflowitem table it is submitted to a workflow
      • And if it is in the workspaceitem table it is in the pre-submitted state
      • -
    • - -
    • The item seems to be in a pre-submitted state, so I tried to delete it from there:

      - +
    + +
  • The item seems to be in a pre-submitted state, so I tried to delete it from there:
  • +
    dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
     DELETE 1
    -
    - -
  • But after this I tried to delete the item from the XMLUI and it is still present…

  • +
      +
    • But after this I tried to delete the item from the XMLUI and it is still present…
    Read more → @@ -360,35 +340,30 @@ DELETE 1

    -

    2019-04-01

    - +

    2019-04-01

    • Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc -
      • They asked if we had plans to enable RDF support in CGSpace
      • -
    • - -
    • There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today

      - +
    + +
  • There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today
      -
    • I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!

      - +
    • I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!
    • +
    +
  • +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
    -4432 200
    -
    - - -
  • In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses

  • - -
  • Apply country and region corrections and deletions on DSpace Test and CGSpace:

    - + 4432 200 +
      +
    • In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
    • +
    • Apply country and region corrections and deletions on DSpace Test and CGSpace:
    • +
    $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
     $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
    -
  • - + Read more → @@ -406,20 +381,19 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace

    -

    2019-03-01

    - +

    2019-03-01

      -
    • I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
    • +
    • I checked IITA's 259 Feb 14 records from last month for duplicates using Atmire's Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
    • I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…
    • -
    • Looking at the other half of Udana’s WLE records from 2018-11 - +
    • Looking at the other half of Udana's WLE records from 2018-11
      • I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)
      • I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items
      • Most worryingly, there are encoding errors in the abstracts for eleven items, for example:
      • 68.15% � 9.45 instead of 68.15% ± 9.45
      • 2003�2013 instead of 2003–2013
      • -
    • +
    +
  • I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs
  • Read more → diff --git a/docs/index.xml b/docs/index.xml index 624a0faf6..46d3ea18f 100644 --- a/docs/index.xml +++ b/docs/index.xml @@ -17,31 +17,27 @@ Mon, 04 Nov 2019 12:20:30 +0200 https://alanorth.github.io/cgspace-notes/2019-11/ - <h2 id="2019-11-04">2019-11-04</h2> - + <h2 id="20191104">2019-11-04</h2> <ul> -<li><p>Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics</p> - +<li>Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics <ul> -<li><p>I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:</p> - +<li>I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:</li> +</ul> +</li> +</ul> <pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot; 4671942 # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot; 1277694 -</code></pre></li> -</ul></li> - -<li><p>So 4.6 million from XMLUI and another 1.2 million from API requests</p></li> - -<li><p>Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</p> - +</code></pre><ul> +<li>So 4.6 million from XMLUI and another 1.2 million from API requests</li> +<li>Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li> +</ul> <pre><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot; 1183456 # zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot; 106781 -</code></pre></li> -</ul> +</code></pre> @@ -51,7 +47,6 @@ https://alanorth.github.io/cgspace-notes/cgspace-cgcorev2-migration/ <p>Possible changes to CGSpace metadata fields to align more with DC, QDC, and DCTERMS as well as CG Core v2.</p> - <p>With reference to <a href="https://agriculturalsemantics.github.io/cg-core/cgcore.html">CG Core v2 draft standard</a> by Marie-Angélique as well as <a href="http://www.dublincore.org/specifications/dublin-core/dcmi-terms/">DCMI DCTERMS</a>.</p> @@ -61,8 +56,7 @@ Tue, 01 Oct 2019 13:20:51 +0300 https://alanorth.github.io/cgspace-notes/2019-10/ - 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace - I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: + 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c 'id,dc. @@ -71,37 +65,34 @@ Sun, 01 Sep 2019 10:17:51 +0300 https://alanorth.github.io/cgspace-notes/2019-09/ - <h2 id="2019-09-01">2019-09-01</h2> - + <h2 id="20190901">2019-09-01</h2> <ul> <li>Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning</li> - -<li><p>Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:</p> - +<li>Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:</li> +</ul> <pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 -440 17.58.101.255 -441 157.55.39.101 -485 207.46.13.43 -728 169.60.128.125 -730 207.46.13.108 -758 157.55.39.9 -808 66.160.140.179 -814 207.46.13.212 -2472 163.172.71.23 -6092 3.94.211.189 + 440 17.58.101.255 + 441 157.55.39.101 + 485 207.46.13.43 + 728 169.60.128.125 + 730 207.46.13.108 + 758 157.55.39.9 + 808 66.160.140.179 + 814 207.46.13.212 + 2472 163.172.71.23 + 6092 3.94.211.189 # zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 - 33 2a01:7e00::f03c:91ff:fe16:fcb - 57 3.83.192.124 - 57 3.87.77.25 - 57 54.82.1.8 -822 2a01:9cc0:47:1:1a:4:0:2 -1223 45.5.184.72 -1633 172.104.229.92 -5112 205.186.128.185 -7249 2a01:7e00::f03c:91ff:fe18:7396 -9124 45.5.186.2 -</code></pre></li> -</ul> + 33 2a01:7e00::f03c:91ff:fe16:fcb + 57 3.83.192.124 + 57 3.87.77.25 + 57 54.82.1.8 + 822 2a01:9cc0:47:1:1a:4:0:2 + 1223 45.5.184.72 + 1633 172.104.229.92 + 5112 205.186.128.185 + 7249 2a01:7e00::f03c:91ff:fe18:7396 + 9124 45.5.186.2 +</code></pre> @@ -110,22 +101,19 @@ Sat, 03 Aug 2019 12:39:51 +0300 https://alanorth.github.io/cgspace-notes/2019-08/ - <h2 id="2019-08-03">2019-08-03</h2> - + <h2 id="20190803">2019-08-03</h2> <ul> -<li>Look at Bioversity&rsquo;s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name&hellip;</li> +<li>Look at Bioversity's latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name&hellip;</li> </ul> - -<h2 id="2019-08-04">2019-08-04</h2> - +<h2 id="20190804">2019-08-04</h2> <ul> <li>Deploy ORCID identifier updates requested by Bioversity to CGSpace</li> <li>Run system updates on CGSpace (linode18) and reboot it - <ul> <li>Before updating it I checked Solr and verified that all statistics cores were loaded properly&hellip;</li> -<li>After rebooting, all statistics cores were loaded&hellip; wow, that&rsquo;s lucky.</li> -</ul></li> +<li>After rebooting, all statistics cores were loaded&hellip; wow, that's lucky.</li> +</ul> +</li> <li>Run system updates on DSpace Test (linode19) and reboot it</li> </ul> @@ -136,16 +124,15 @@ Mon, 01 Jul 2019 12:13:51 +0300 https://alanorth.github.io/cgspace-notes/2019-07/ - <h2 id="2019-07-01">2019-07-01</h2> - + <h2 id="20190701">2019-07-01</h2> <ul> <li>Create an &ldquo;AfricaRice books and book chapters&rdquo; collection on CGSpace for AfricaRice</li> <li>Last month Sisay asked why the following &ldquo;most popular&rdquo; statistics link for a range of months in 2018 works for the CIAT community on DSpace Test, but not on CGSpace: - <ul> <li><a href="https://dspacetest.cgiar.org/handle/10568/35697/most-popular/item#simplefilter=custom&amp;time_filter_end_date=01%2F12%2F2018">DSpace Test</a></li> <li><a href="https://cgspace.cgiar.org/handle/10568/35697/most-popular/item#simplefilter=custom&amp;time_filter_end_date=01%2F12%2F2018">CGSpace</a></li> -</ul></li> +</ul> +</li> <li>Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community</li> </ul> @@ -156,15 +143,12 @@ Sun, 02 Jun 2019 10:57:51 +0300 https://alanorth.github.io/cgspace-notes/2019-06/ - <h2 id="2019-06-02">2019-06-02</h2> - + <h2 id="20190602">2019-06-02</h2> <ul> <li>Merge the <a href="https://github.com/ilri/DSpace/pull/425">Solr filterCache</a> and <a href="https://github.com/ilri/DSpace/pull/426">XMLUI ISI journal</a> changes to the <code>5_x-prod</code> branch and deploy on CGSpace</li> <li>Run system updates on CGSpace (linode18) and reboot it</li> </ul> - -<h2 id="2019-06-03">2019-06-03</h2> - +<h2 id="20190603">2019-06-03</h2> <ul> <li>Skype with Marie-Angélique and Abenet about <a href="https://agriculturalsemantics.github.io/cg-core/cgcore.html">CG Core v2</a></li> </ul> @@ -176,24 +160,21 @@ Wed, 01 May 2019 07:37:43 +0300 https://alanorth.github.io/cgspace-notes/2019-05/ - <h2 id="2019-05-01">2019-05-01</h2> - + <h2 id="20190501">2019-05-01</h2> <ul> <li>Help CCAFS with regenerating some item thumbnails after they uploaded new PDFs to some items on CGSpace</li> <li>A user on the dspace-tech mailing list offered some suggestions for troubleshooting the problem with the inability to delete certain items - <ul> <li>Apparently if the item is in the <code>workflowitem</code> table it is submitted to a workflow</li> <li>And if it is in the <code>workspaceitem</code> table it is in the pre-submitted state</li> -</ul></li> - -<li><p>The item seems to be in a pre-submitted state, so I tried to delete it from there:</p> - +</ul> +</li> +<li>The item seems to be in a pre-submitted state, so I tried to delete it from there:</li> +</ul> <pre><code>dspace=# DELETE FROM workspaceitem WHERE item_id=74648; DELETE 1 -</code></pre></li> - -<li><p>But after this I tried to delete the item from the XMLUI and it is <em>still</em> present&hellip;</p></li> +</code></pre><ul> +<li>But after this I tried to delete the item from the XMLUI and it is <em>still</em> present&hellip;</li> </ul> @@ -203,35 +184,30 @@ DELETE 1 Mon, 01 Apr 2019 09:00:43 +0300 https://alanorth.github.io/cgspace-notes/2019-04/ - <h2 id="2019-04-01">2019-04-01</h2> - + <h2 id="20190401">2019-04-01</h2> <ul> <li>Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc - <ul> <li>They asked if we had plans to enable RDF support in CGSpace</li> -</ul></li> - -<li><p>There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today</p> - +</ul> +</li> +<li>There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today <ul> -<li><p>I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!</p> - +<li>I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!</li> +</ul> +</li> +</ul> <pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5 -4432 200 -</code></pre></li> -</ul></li> - -<li><p>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</p></li> - -<li><p>Apply country and region corrections and deletions on DSpace Test and CGSpace:</p> - + 4432 200 +</code></pre><ul> +<li>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</li> +<li>Apply country and region corrections and deletions on DSpace Test and CGSpace:</li> +</ul> <pre><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d -</code></pre></li> -</ul> +</code></pre> @@ -240,20 +216,19 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace Fri, 01 Mar 2019 12:16:30 +0100 https://alanorth.github.io/cgspace-notes/2019-03/ - <h2 id="2019-03-01">2019-03-01</h2> - + <h2 id="20190301">2019-03-01</h2> <ul> -<li>I checked IITA&rsquo;s 259 Feb 14 records from last month for duplicates using Atmire&rsquo;s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good</li> +<li>I checked IITA's 259 Feb 14 records from last month for duplicates using Atmire's Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good</li> <li>I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc&hellip;</li> -<li>Looking at the other half of Udana&rsquo;s WLE records from 2018-11 - +<li>Looking at the other half of Udana's WLE records from 2018-11 <ul> <li>I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)</li> <li>I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items</li> <li>Most worryingly, there are encoding errors in the abstracts for eleven items, for example:</li> <li>68.15% � 9.45 instead of 68.15% ± 9.45</li> <li>2003�2013 instead of 2003–2013</li> -</ul></li> +</ul> +</li> <li>I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs</li> </ul> @@ -264,40 +239,34 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace Fri, 01 Feb 2019 21:37:30 +0200 https://alanorth.github.io/cgspace-notes/2019-02/ - <h2 id="2019-02-01">2019-02-01</h2> - + <h2 id="20190201">2019-02-01</h2> <ul> <li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li> - -<li><p>The top IPs before, during, and after this latest alert tonight were:</p> - +<li>The top IPs before, during, and after this latest alert tonight were:</li> +</ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 -245 207.46.13.5 -332 54.70.40.11 -385 5.143.231.38 -405 207.46.13.173 -405 207.46.13.75 -1117 66.249.66.219 -1121 35.237.175.180 -1546 5.9.6.51 -2474 45.5.186.2 -5490 85.25.237.71 -</code></pre></li> - -<li><p><code>85.25.237.71</code> is the &ldquo;Linguee Bot&rdquo; that I first saw last month</p></li> - -<li><p>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</p></li> - -<li><p>There were just over 3 million accesses in the nginx logs last month:</p> - + 245 207.46.13.5 + 332 54.70.40.11 + 385 5.143.231.38 + 405 207.46.13.173 + 405 207.46.13.75 + 1117 66.249.66.219 + 1121 35.237.175.180 + 1546 5.9.6.51 + 2474 45.5.186.2 + 5490 85.25.237.71 +</code></pre><ul> +<li><code>85.25.237.71</code> is the &ldquo;Linguee Bot&rdquo; that I first saw last month</li> +<li>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</li> +<li>There were just over 3 million accesses in the nginx logs last month:</li> +</ul> <pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot; 3018243 real 0m19.873s user 0m22.203s sys 0m1.979s -</code></pre></li> -</ul> +</code></pre> @@ -306,26 +275,23 @@ sys 0m1.979s Wed, 02 Jan 2019 09:48:30 +0200 https://alanorth.github.io/cgspace-notes/2019-01/ - <h2 id="2019-01-02">2019-01-02</h2> - + <h2 id="20190102">2019-01-02</h2> <ul> <li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li> - -<li><p>I don&rsquo;t see anything interesting in the web server logs around that time though:</p> - +<li>I don't see anything interesting in the web server logs around that time though:</li> +</ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 - 92 40.77.167.4 - 99 210.7.29.100 -120 38.126.157.45 -177 35.237.175.180 -177 40.77.167.32 -216 66.249.75.219 -225 18.203.76.93 -261 46.101.86.248 -357 207.46.13.1 -903 54.70.40.11 -</code></pre></li> -</ul> + 92 40.77.167.4 + 99 210.7.29.100 + 120 38.126.157.45 + 177 35.237.175.180 + 177 40.77.167.32 + 216 66.249.75.219 + 225 18.203.76.93 + 261 46.101.86.248 + 357 207.46.13.1 + 903 54.70.40.11 +</code></pre> @@ -334,16 +300,13 @@ sys 0m1.979s Sun, 02 Dec 2018 02:09:30 +0200 https://alanorth.github.io/cgspace-notes/2018-12/ - <h2 id="2018-12-01">2018-12-01</h2> - + <h2 id="20181201">2018-12-01</h2> <ul> <li>Switch CGSpace (linode18) to use OpenJDK instead of Oracle JDK</li> <li>I manually installed OpenJDK, then removed Oracle JDK, then re-ran the <a href="http://github.com/ilri/rmg-ansible-public">Ansible playbook</a> to update all configuration files, etc</li> <li>Then I ran all system updates and restarted the server</li> </ul> - -<h2 id="2018-12-02">2018-12-02</h2> - +<h2 id="20181202">2018-12-02</h2> <ul> <li>I noticed that there is another issue with PDF thumbnails on CGSpace, and I see there was another <a href="https://usn.ubuntu.com/3831-1/">Ghostscript vulnerability last week</a></li> </ul> @@ -355,15 +318,12 @@ sys 0m1.979s Thu, 01 Nov 2018 16:41:30 +0200 https://alanorth.github.io/cgspace-notes/2018-11/ - <h2 id="2018-11-01">2018-11-01</h2> - + <h2 id="20181101">2018-11-01</h2> <ul> <li>Finalize AReS Phase I and Phase II ToRs</li> <li>Send a note about my <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a> to the dspace-tech mailing list</li> </ul> - -<h2 id="2018-11-03">2018-11-03</h2> - +<h2 id="20181103">2018-11-03</h2> <ul> <li>Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage</li> <li>Today these are the top 10 IPs:</li> @@ -376,11 +336,10 @@ sys 0m1.979s Mon, 01 Oct 2018 22:31:54 +0300 https://alanorth.github.io/cgspace-notes/2018-10/ - <h2 id="2018-10-01">2018-10-01</h2> - + <h2 id="20181001">2018-10-01</h2> <ul> <li>Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items</li> -<li>I created a GitHub issue to track this <a href="https://github.com/ilri/DSpace/issues/389">#389</a>, because I&rsquo;m super busy in Nairobi right now</li> +<li>I created a GitHub issue to track this <a href="https://github.com/ilri/DSpace/issues/389">#389</a>, because I'm super busy in Nairobi right now</li> </ul> @@ -390,13 +349,12 @@ sys 0m1.979s Sun, 02 Sep 2018 09:55:54 +0300 https://alanorth.github.io/cgspace-notes/2018-09/ - <h2 id="2018-09-02">2018-09-02</h2> - + <h2 id="20180902">2018-09-02</h2> <ul> <li>New <a href="https://jdbc.postgresql.org/documentation/changelog.html#version_42.2.5">PostgreSQL JDBC driver version 42.2.5</a></li> -<li>I&rsquo;ll update the DSpace role in our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a> and run the updated playbooks on CGSpace and DSpace Test</li> -<li>Also, I&rsquo;ll re-run the <code>postgresql</code> tasks because the custom PostgreSQL variables are dynamic according to the system&rsquo;s RAM, and we never re-ran them after migrating to larger Linodes last month</li> -<li>I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&rsquo;m getting those autowire errors in Tomcat 8.5.30 again:</li> +<li>I'll update the DSpace role in our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a> and run the updated playbooks on CGSpace and DSpace Test</li> +<li>Also, I'll re-run the <code>postgresql</code> tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month</li> +<li>I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:</li> </ul> @@ -406,27 +364,20 @@ sys 0m1.979s Wed, 01 Aug 2018 11:52:54 +0300 https://alanorth.github.io/cgspace-notes/2018-08/ - <h2 id="2018-08-01">2018-08-01</h2> - + <h2 id="20180801">2018-08-01</h2> <ul> -<li><p>DSpace Test had crashed at some point yesterday morning and I see the following in <code>dmesg</code>:</p> - +<li>DSpace Test had crashed at some point yesterday morning and I see the following in <code>dmesg</code>:</li> +</ul> <pre><code>[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB -</code></pre></li> - -<li><p>Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight</p></li> - -<li><p>From the DSpace log I see that eventually Solr stopped responding, so I guess the <code>java</code> process that was OOM killed above was Tomcat&rsquo;s</p></li> - -<li><p>I&rsquo;m not sure why Tomcat didn&rsquo;t crash with an OutOfMemoryError&hellip;</p></li> - -<li><p>Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core</p></li> - -<li><p>The server only has 8GB of RAM so we&rsquo;ll eventually need to upgrade to a larger one because we&rsquo;ll start starving the OS, PostgreSQL, and command line batch processes</p></li> - -<li><p>I ran all system updates on DSpace Test and rebooted it</p></li> +</code></pre><ul> +<li>Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight</li> +<li>From the DSpace log I see that eventually Solr stopped responding, so I guess the <code>java</code> process that was OOM killed above was Tomcat's</li> +<li>I'm not sure why Tomcat didn't crash with an OutOfMemoryError&hellip;</li> +<li>Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core</li> +<li>The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes</li> +<li>I ran all system updates on DSpace Test and rebooted it</li> </ul> @@ -436,19 +387,16 @@ sys 0m1.979s Sun, 01 Jul 2018 12:56:54 +0300 https://alanorth.github.io/cgspace-notes/2018-07/ - <h2 id="2018-07-01">2018-07-01</h2> - + <h2 id="20180701">2018-07-01</h2> <ul> -<li><p>I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:</p> - +<li>I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:</li> +</ul> <pre><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace -</code></pre></li> - -<li><p>During the <code>mvn package</code> stage on the 5.8 branch I kept getting issues with java running out of memory:</p> - +</code></pre><ul> +<li>During the <code>mvn package</code> stage on the 5.8 branch I kept getting issues with java running out of memory:</li> +</ul> <pre><code>There is insufficient memory for the Java Runtime Environment to continue. -</code></pre></li> -</ul> +</code></pre> @@ -457,32 +405,27 @@ sys 0m1.979s Mon, 04 Jun 2018 19:49:54 -0700 https://alanorth.github.io/cgspace-notes/2018-06/ - <h2 id="2018-06-04">2018-06-04</h2> - + <h2 id="20180604">2018-06-04</h2> <ul> <li>Test the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560">DSpace 5.8 module upgrades from Atmire</a> (<a href="https://github.com/ilri/DSpace/pull/378">#378</a>) - <ul> -<li>There seems to be a problem with the CUA and L&amp;R versions in <code>pom.xml</code> because they are using SNAPSHOT and it doesn&rsquo;t build</li> -</ul></li> +<li>There seems to be a problem with the CUA and L&amp;R versions in <code>pom.xml</code> because they are using SNAPSHOT and it doesn't build</li> +</ul> +</li> <li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li> - -<li><p>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</p> - +<li>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</li> +</ul> <pre><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n -</code></pre></li> - -<li><p>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-03/">March, 2018</a></p></li> - -<li><p>Time to index ~70,000 items on CGSpace:</p> - +</code></pre><ul> +<li>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-03/">March, 2018</a></li> +<li>Time to index ~70,000 items on CGSpace:</li> +</ul> <pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b real 74m42.646s user 8m5.056s sys 2m7.289s -</code></pre></li> -</ul> +</code></pre> @@ -491,15 +434,14 @@ sys 2m7.289s Tue, 01 May 2018 16:43:54 +0300 https://alanorth.github.io/cgspace-notes/2018-05/ - <h2 id="2018-05-01">2018-05-01</h2> - + <h2 id="20180501">2018-05-01</h2> <ul> <li>I cleared the Solr statistics core on DSpace Test by issuing two commands directly to the Solr admin interface: - <ul> -<li><a href="http://localhost:3000/solr/statistics/update?stream.body=%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E">http://localhost:3000/solr/statistics/update?stream.body=%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E</a></li> -<li><a href="http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E">http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E</a></li> -</ul></li> +<li>http://localhost:3000/solr/statistics/update?stream.body=%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E</li> +<li>http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E</li> +</ul> +</li> <li>Then I reduced the JVM heap size from 6144 back to 5120m</li> <li>Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a> to support hosts choosing which distribution they want to use</li> </ul> @@ -511,10 +453,9 @@ sys 2m7.289s Sun, 01 Apr 2018 16:13:54 +0200 https://alanorth.github.io/cgspace-notes/2018-04/ - <h2 id="2018-04-01">2018-04-01</h2> - + <h2 id="20180401">2018-04-01</h2> <ul> -<li>I tried to test something on DSpace Test but noticed that it&rsquo;s down since god knows when</li> +<li>I tried to test something on DSpace Test but noticed that it's down since god knows when</li> <li>Catalina logs at least show some memory errors yesterday:</li> </ul> @@ -525,8 +466,7 @@ sys 2m7.289s Fri, 02 Mar 2018 16:07:54 +0200 https://alanorth.github.io/cgspace-notes/2018-03/ - <h2 id="2018-03-02">2018-03-02</h2> - + <h2 id="20180302">2018-03-02</h2> <ul> <li>Export a CSV of the IITA community metadata for Martin Mueller</li> </ul> @@ -538,13 +478,12 @@ sys 2m7.289s Thu, 01 Feb 2018 16:28:54 +0200 https://alanorth.github.io/cgspace-notes/2018-02/ - <h2 id="2018-02-01">2018-02-01</h2> - + <h2 id="20180201">2018-02-01</h2> <ul> <li>Peter gave feedback on the <code>dc.rights</code> proof of concept that I had sent him last week</li> -<li>We don&rsquo;t need to distinguish between internal and external works, so that makes it just a simple list</li> +<li>We don't need to distinguish between internal and external works, so that makes it just a simple list</li> <li>Yesterday I figured out how to monitor DSpace sessions using JMX</li> -<li>I copied the logic in the <code>jmx_tomcat_dbpools</code> provided by Ubuntu&rsquo;s <code>munin-plugins-java</code> package and used the stuff I discovered about JMX <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-01/">in 2018-01</a></li> +<li>I copied the logic in the <code>jmx_tomcat_dbpools</code> provided by Ubuntu's <code>munin-plugins-java</code> package and used the stuff I discovered about JMX <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-01/">in 2018-01</a></li> </ul> @@ -554,33 +493,26 @@ sys 2m7.289s Tue, 02 Jan 2018 08:35:54 -0800 https://alanorth.github.io/cgspace-notes/2018-01/ - <h2 id="2018-01-02">2018-01-02</h2> - + <h2 id="20180102">2018-01-02</h2> <ul> <li>Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time</li> -<li>I didn&rsquo;t get any load alerts from Linode and the REST and XMLUI logs don&rsquo;t show anything out of the ordinary</li> +<li>I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary</li> <li>The nginx logs show HTTP 200s until <code>02/Jan/2018:11:27:17 +0000</code> when Uptime Robot got an HTTP 500</li> <li>In dspace.log around that time I see many errors like &ldquo;Client closed the connection before file download was complete&rdquo;</li> - -<li><p>And just before that I see this:</p> - +<li>And just before that I see this:</li> +</ul> <pre><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000]. -</code></pre></li> - -<li><p>Ah hah! So the pool was actually empty!</p></li> - -<li><p>I need to increase that, let&rsquo;s try to bump it up from 50 to 75</p></li> - -<li><p>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&rsquo;t know what the hell Uptime Robot saw</p></li> - -<li><p>I notice this error quite a few times in dspace.log:</p> - +</code></pre><ul> +<li>Ah hah! So the pool was actually empty!</li> +<li>I need to increase that, let's try to bump it up from 50 to 75</li> +<li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw</li> +<li>I notice this error quite a few times in dspace.log:</li> +</ul> <pre><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32. -</code></pre></li> - -<li><p>And there are many of these errors every day for the past month:</p> - +</code></pre><ul> +<li>And there are many of these errors every day for the past month:</li> +</ul> <pre><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.* dspace.log.2017-11-21:4 dspace.log.2017-11-22:1 @@ -625,9 +557,8 @@ dspace.log.2017-12-30:89 dspace.log.2017-12-31:53 dspace.log.2018-01-01:45 dspace.log.2018-01-02:34 -</code></pre></li> - -<li><p>Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let&rsquo;s Encrypt if it&rsquo;s just a handful of domains</p></li> +</code></pre><ul> +<li>Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains</li> </ul> @@ -637,8 +568,7 @@ dspace.log.2018-01-02:34 Fri, 01 Dec 2017 13:53:54 +0300 https://alanorth.github.io/cgspace-notes/2017-12/ - <h2 id="2017-12-01">2017-12-01</h2> - + <h2 id="20171201">2017-12-01</h2> <ul> <li>Uptime Robot noticed that CGSpace went down</li> <li>The logs say &ldquo;Timeout waiting for idle object&rdquo;</li> @@ -653,27 +583,22 @@ dspace.log.2018-01-02:34 Thu, 02 Nov 2017 09:37:54 +0200 https://alanorth.github.io/cgspace-notes/2017-11/ - <h2 id="2017-11-01">2017-11-01</h2> - + <h2 id="20171101">2017-11-01</h2> <ul> <li>The CORE developers responded to say they are looking into their bot not respecting our robots.txt</li> </ul> - -<h2 id="2017-11-02">2017-11-02</h2> - +<h2 id="20171102">2017-11-02</h2> <ul> -<li><p>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</p> - +<li>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</li> +</ul> <pre><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log 0 -</code></pre></li> - -<li><p>Generate list of authors on CGSpace for Peter to go through and correct:</p> - +</code></pre><ul> +<li>Generate list of authors on CGSpace for Peter to go through and correct:</li> +</ul> <pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv; COPY 54701 -</code></pre></li> -</ul> +</code></pre> @@ -682,17 +607,14 @@ COPY 54701 Sun, 01 Oct 2017 08:07:54 +0300 https://alanorth.github.io/cgspace-notes/2017-10/ - <h2 id="2017-10-01">2017-10-01</h2> - + <h2 id="20171001">2017-10-01</h2> <ul> -<li><p>Peter emailed to point out that many items in the <a href="https://cgspace.cgiar.org/handle/10568/2703">ILRI archive collection</a> have multiple handles:</p> - +<li>Peter emailed to point out that many items in the <a href="https://cgspace.cgiar.org/handle/10568/2703">ILRI archive collection</a> have multiple handles:</li> +</ul> <pre><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336 -</code></pre></li> - -<li><p>There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</p></li> - -<li><p>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</p></li> +</code></pre><ul> +<li>There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li> +<li>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</li> </ul> @@ -711,16 +633,13 @@ COPY 54701 Thu, 07 Sep 2017 16:54:52 +0700 https://alanorth.github.io/cgspace-notes/2017-09/ - <h2 id="2017-09-06">2017-09-06</h2> - + <h2 id="20170906">2017-09-06</h2> <ul> <li>Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two hours</li> </ul> - -<h2 id="2017-09-07">2017-09-07</h2> - +<h2 id="20170907">2017-09-07</h2> <ul> -<li>Ask Sisay to clean up the WLE approvers a bit, as Marianne&rsquo;s user account is both in the approvers step as well as the group</li> +<li>Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group</li> </ul> @@ -730,22 +649,21 @@ COPY 54701 Tue, 01 Aug 2017 11:51:52 +0300 https://alanorth.github.io/cgspace-notes/2017-08/ - <h2 id="2017-08-01">2017-08-01</h2> - + <h2 id="20170801">2017-08-01</h2> <ul> <li>Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours</li> <li>I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)</li> <li>The good thing is that, according to <code>dspace.log.2017-08-01</code>, they are all using the same Tomcat session</li> <li>This means our Tomcat Crawler Session Valve is working</li> <li>But many of the bots are browsing dynamic URLs like: - <ul> <li>/handle/10568/3353/discover</li> <li>/handle/10568/16510/browse</li> -</ul></li> +</ul> +</li> <li>The <code>robots.txt</code> only blocks the top-level <code>/discover</code> and <code>/browse</code> URLs&hellip; we will need to find a way to forbid them from accessing these!</li> <li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li> -<li>It turns out that we&rsquo;re already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li> +<li>It turns out that we're already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li> <li>Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;</li> <li>We might actually have to <em>block</em> these requests with HTTP 403 depending on the user agent</li> <li>Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415</li> @@ -761,18 +679,15 @@ COPY 54701 Sat, 01 Jul 2017 18:03:52 +0300 https://alanorth.github.io/cgspace-notes/2017-07/ - <h2 id="2017-07-01">2017-07-01</h2> - + <h2 id="20170701">2017-07-01</h2> <ul> <li>Run system updates and reboot DSpace Test</li> </ul> - -<h2 id="2017-07-04">2017-07-04</h2> - +<h2 id="20170704">2017-07-04</h2> <ul> <li>Merge changes for WLE Phase II theme rename (<a href="https://github.com/ilri/DSpace/pull/329">#329</a>)</li> -<li>Looking at extracting the metadata registries from ICARDA&rsquo;s MEL DSpace database so we can compare fields with CGSpace</li> -<li>We can use PostgreSQL&rsquo;s extended output format (<code>-x</code>) plus <code>sed</code> to format the output into quasi XML:</li> +<li>Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace</li> +<li>We can use PostgreSQL's extended output format (<code>-x</code>) plus <code>sed</code> to format the output into quasi XML:</li> </ul> @@ -782,7 +697,7 @@ COPY 54701 Thu, 01 Jun 2017 10:14:52 +0300 https://alanorth.github.io/cgspace-notes/2017-06/ - 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we&rsquo;ll create a new sub-community for Phase II and create collections for the research themes there The current &ldquo;Research Themes&rdquo; community will be renamed to &ldquo;WLE Phase I Research Themes&rdquo; Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. + 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we'll create a new sub-community for Phase II and create collections for the research themes there The current &ldquo;Research Themes&rdquo; community will be renamed to &ldquo;WLE Phase I Research Themes&rdquo; Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. @@ -791,7 +706,7 @@ COPY 54701 Mon, 01 May 2017 16:21:52 +0200 https://alanorth.github.io/cgspace-notes/2017-05/ - 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it&rsquo;s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire&rsquo;s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. + 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it's a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire's CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. @@ -800,23 +715,18 @@ COPY 54701 Sun, 02 Apr 2017 17:08:52 +0200 https://alanorth.github.io/cgspace-notes/2017-04/ - <h2 id="2017-04-02">2017-04-02</h2> - + <h2 id="20170402">2017-04-02</h2> <ul> <li>Merge one change to CCAFS flagships that I had forgotten to remove last month (&ldquo;MANAGING CLIMATE RISK&rdquo;): <a href="https://github.com/ilri/DSpace/pull/317">https://github.com/ilri/DSpace/pull/317</a></li> <li>Quick proof-of-concept hack to add <code>dc.rights</code> to the input form, including some inline instructions/hints:</li> </ul> - -<p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2017/04/dc-rights.png" alt="dc.rights in the submission form" /></p> - +<p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2017/04/dc-rights.png" alt="dc.rights in the submission form"></p> <ul> <li>Remove redundant/duplicate text in the DSpace submission license</li> - -<li><p>Testing the CMYK patch on a collection with 650 items:</p> - +<li>Testing the CMYK patch on a collection with 650 items:</li> +</ul> <pre><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt -</code></pre></li> -</ul> +</code></pre> @@ -825,14 +735,11 @@ COPY 54701 Wed, 01 Mar 2017 17:08:52 +0200 https://alanorth.github.io/cgspace-notes/2017-03/ - <h2 id="2017-03-01">2017-03-01</h2> - + <h2 id="20170301">2017-03-01</h2> <ul> <li>Run the 279 CIAT author corrections on CGSpace</li> </ul> - -<h2 id="2017-03-02">2017-03-02</h2> - +<h2 id="20170302">2017-03-02</h2> <ul> <li>Skype with Michael and Peter, discussing moving the CGIAR Library to CGSpace</li> <li>CGIAR people possibly open to moving content, redirecting library.cgiar.org to CGSpace and letting CGSpace resolve their handles</li> @@ -842,13 +749,11 @@ COPY 54701 <li>Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI</li> <li>Filed an issue on DSpace issue tracker for the <code>filter-media</code> bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: <a href="https://jira.duraspace.org/browse/DS-3516">DS-3516</a></li> <li>Discovered that the ImageMagic <code>filter-media</code> plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK</li> - -<li><p>Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing regeneration using DSpace 5.x&rsquo;s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999"><sup>10568</sup>&frasl;<sub>51999</sub></a>):</p> - +<li>Interestingly, it seems DSpace 4.x's thumbnails were sRGB, but forcing regeneration using DSpace 5.x's ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999">10568/51999</a>):</li> +</ul> <pre><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000 -</code></pre></li> -</ul> +</code></pre> @@ -857,25 +762,22 @@ COPY 54701 Tue, 07 Feb 2017 07:04:52 -0800 https://alanorth.github.io/cgspace-notes/2017-02/ - <h2 id="2017-02-07">2017-02-07</h2> - + <h2 id="20170207">2017-02-07</h2> <ul> -<li><p>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</p> - +<li>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</li> +</ul> <pre><code>dspace=# select * from collection2item where item_id = '80278'; -id | collection_id | item_id + id | collection_id | item_id -------+---------------+--------- -92551 | 313 | 80278 -92550 | 313 | 80278 -90774 | 1051 | 80278 + 92551 | 313 | 80278 + 92550 | 313 | 80278 + 90774 | 1051 | 80278 (3 rows) dspace=# delete from collection2item where id = 92551 and item_id = 80278; DELETE 1 -</code></pre></li> - -<li><p>Create issue on GitHub to track the addition of CCAFS Phase II project tags (<a href="https://github.com/ilri/DSpace/issues/301">#301</a>)</p></li> - -<li><p>Looks like we&rsquo;ll be using <code>cg.identifier.ccafsprojectpii</code> as the field name</p></li> +</code></pre><ul> +<li>Create issue on GitHub to track the addition of CCAFS Phase II project tags (<a href="https://github.com/ilri/DSpace/issues/301">#301</a>)</li> +<li>Looks like we'll be using <code>cg.identifier.ccafsprojectpii</code> as the field name</li> </ul> @@ -885,12 +787,11 @@ DELETE 1 Mon, 02 Jan 2017 10:43:00 +0300 https://alanorth.github.io/cgspace-notes/2017-01/ - <h2 id="2017-01-02">2017-01-02</h2> - + <h2 id="20170102">2017-01-02</h2> <ul> <li>I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error</li> -<li>I tested on DSpace Test as well and it doesn&rsquo;t work there either</li> -<li>I asked on the dspace-tech mailing list because it seems to be broken, and actually now I&rsquo;m not sure if we&rsquo;ve ever had the sharding task run successfully over all these years</li> +<li>I tested on DSpace Test as well and it doesn't work there either</li> +<li>I asked on the dspace-tech mailing list because it seems to be broken, and actually now I'm not sure if we've ever had the sharding task run successfully over all these years</li> </ul> @@ -900,25 +801,20 @@ DELETE 1 Fri, 02 Dec 2016 10:43:00 +0300 https://alanorth.github.io/cgspace-notes/2016-12/ - <h2 id="2016-12-02">2016-12-02</h2> - + <h2 id="20161202">2016-12-02</h2> <ul> <li>CGSpace was down for five hours in the morning while I was sleeping</li> - -<li><p>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</p> - +<li>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</li> +</ul> <pre><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&quot;dc.title&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&quot;THUMBNAIL&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&quot;-1&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;) -</code></pre></li> - -<li><p>I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade</p></li> - -<li><p>I&rsquo;ve raised a ticket with Atmire to ask</p></li> - -<li><p>Another worrying error from dspace.log is:</p></li> +</code></pre><ul> +<li>I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade</li> +<li>I've raised a ticket with Atmire to ask</li> +<li>Another worrying error from dspace.log is:</li> </ul> @@ -928,13 +824,11 @@ DELETE 1 Tue, 01 Nov 2016 09:21:00 +0300 https://alanorth.github.io/cgspace-notes/2016-11/ - <h2 id="2016-11-01">2016-11-01</h2> - + <h2 id="20161101">2016-11-01</h2> <ul> -<li>Add <code>dc.type</code> to the output options for Atmire&rsquo;s Listings and Reports module (<a href="https://github.com/ilri/DSpace/pull/286">#286</a>)</li> +<li>Add <code>dc.type</code> to the output options for Atmire's Listings and Reports module (<a href="https://github.com/ilri/DSpace/pull/286">#286</a>)</li> </ul> - -<p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/11/listings-and-reports.png" alt="Listings and Reports with output type" /></p> +<p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/11/listings-and-reports.png" alt="Listings and Reports with output type"></p> @@ -943,22 +837,19 @@ DELETE 1 Mon, 03 Oct 2016 15:53:00 +0300 https://alanorth.github.io/cgspace-notes/2016-10/ - <h2 id="2016-10-03">2016-10-03</h2> - + <h2 id="20161003">2016-10-03</h2> <ul> <li>Testing adding <a href="https://wiki.duraspace.org/display/DSDOC5x/ORCID+Integration#ORCIDIntegration-EditingexistingitemsusingBatchCSVEditing">ORCIDs to a CSV</a> file for a single item to see if the author orders get messed up</li> <li>Need to test the following scenarios to see how author order is affected: - <ul> <li>ORCIDs only</li> <li>ORCIDs plus normal authors</li> -</ul></li> - -<li><p>I exported a random item&rsquo;s metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</p> - +</ul> +</li> +<li>I exported a random item's metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</li> +</ul> <pre><code>0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X -</code></pre></li> -</ul> +</code></pre> @@ -967,18 +858,15 @@ DELETE 1 Thu, 01 Sep 2016 15:53:00 +0300 https://alanorth.github.io/cgspace-notes/2016-09/ - <h2 id="2016-09-01">2016-09-01</h2> - + <h2 id="20160901">2016-09-01</h2> <ul> <li>Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors</li> -<li>Discuss how the migration of CGIAR&rsquo;s Active Directory to a flat structure will break our LDAP groups in DSpace</li> +<li>Discuss how the migration of CGIAR's Active Directory to a flat structure will break our LDAP groups in DSpace</li> <li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li> - -<li><p>It looks like we might be able to use OUs now, instead of DCs:</p> - +<li>It looks like we might be able to use OUs now, instead of DCs:</li> +</ul> <pre><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot; -</code></pre></li> -</ul> +</code></pre> @@ -987,22 +875,19 @@ DELETE 1 Mon, 01 Aug 2016 15:53:00 +0300 https://alanorth.github.io/cgspace-notes/2016-08/ - <h2 id="2016-08-01">2016-08-01</h2> - + <h2 id="20160801">2016-08-01</h2> <ul> <li>Add updated distribution license from Sisay (<a href="https://github.com/ilri/DSpace/issues/259">#259</a>)</li> <li>Play with upgrading Mirage 2 dependencies in <code>bower.json</code> because most are several versions of out date</li> <li>Bootstrap is at 3.3.0 but upstream is at 3.3.7, and upgrading to anything beyond 3.3.1 breaks glyphicons and probably more</li> <li>bower stuff is a dead end, waste of time, too many issues</li> <li>Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of <code>fonts</code>)</li> - -<li><p>Start working on DSpace 5.1 → 5.5 port:</p> - +<li>Start working on DSpace 5.1 → 5.5 port:</li> +</ul> <pre><code>$ git checkout -b 55new 5_x-prod $ git reset --hard ilri/5_x-prod $ git rebase -i dspace-5.5 -</code></pre></li> -</ul> +</code></pre> @@ -1011,22 +896,19 @@ $ git rebase -i dspace-5.5 Fri, 01 Jul 2016 10:53:00 +0300 https://alanorth.github.io/cgspace-notes/2016-07/ - <h2 id="2016-07-01">2016-07-01</h2> - + <h2 id="20160701">2016-07-01</h2> <ul> <li>Add <code>dc.description.sponsorship</code> to Discovery sidebar facets and make investors clickable in item view (<a href="https://github.com/ilri/DSpace/issues/232">#232</a>)</li> - -<li><p>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</p> - +<li>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</li> +</ul> <pre><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$'; UPDATE 95 dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$'; -text_value + text_value ------------ (0 rows) -</code></pre></li> - -<li><p>In this case the select query was showing 95 results before the update</p></li> +</code></pre><ul> +<li>In this case the select query was showing 95 results before the update</li> </ul> @@ -1036,11 +918,10 @@ text_value Wed, 01 Jun 2016 10:53:00 +0300 https://alanorth.github.io/cgspace-notes/2016-06/ - <h2 id="2016-06-01">2016-06-01</h2> - + <h2 id="20160601">2016-06-01</h2> <ul> <li>Experimenting with IFPRI OAI (we want to harvest their publications)</li> -<li>After reading the <a href="https://www.oclc.org/support/services/contentdm/help/server-admin-help/oai-support.en.html">ContentDM documentation</a> I found IFPRI&rsquo;s OAI endpoint: <a href="http://ebrary.ifpri.org/oai/oai.php">http://ebrary.ifpri.org/oai/oai.php</a></li> +<li>After reading the <a href="https://www.oclc.org/support/services/contentdm/help/server-admin-help/oai-support.en.html">ContentDM documentation</a> I found IFPRI's OAI endpoint: <a href="http://ebrary.ifpri.org/oai/oai.php">http://ebrary.ifpri.org/oai/oai.php</a></li> <li>After reading the <a href="https://www.openarchives.org/OAI/openarchivesprotocol.html">OAI documentation</a> and testing with an <a href="http://validator.oaipmh.com/">OAI validator</a> I found out how to get their publications</li> <li>This is their publications set: <a href="http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&amp;from=2016-01-01&amp;set=p15738coll2&amp;metadataPrefix=oai_dc">http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&amp;from=2016-01-01&amp;set=p15738coll2&amp;metadataPrefix=oai_dc</a></li> <li>You can see the others by using the OAI <code>ListSets</code> verb: <a href="http://ebrary.ifpri.org/oai/oai.php?verb=ListSets">http://ebrary.ifpri.org/oai/oai.php?verb=ListSets</a></li> @@ -1054,18 +935,15 @@ text_value Sun, 01 May 2016 23:06:00 +0300 https://alanorth.github.io/cgspace-notes/2016-05/ - <h2 id="2016-05-01">2016-05-01</h2> - + <h2 id="20160501">2016-05-01</h2> <ul> <li>Since yesterday there have been 10,000 REST errors and the site has been unstable again</li> <li>I have blocked access to the API now</li> - -<li><p>There are 3,000 IPs accessing the REST API in a 24-hour period!</p> - +<li>There are 3,000 IPs accessing the REST API in a 24-hour period!</li> +</ul> <pre><code># awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l 3168 -</code></pre></li> -</ul> +</code></pre> @@ -1074,13 +952,12 @@ text_value Mon, 04 Apr 2016 11:06:00 +0300 https://alanorth.github.io/cgspace-notes/2016-04/ - <h2 id="2016-04-04">2016-04-04</h2> - + <h2 id="20160404">2016-04-04</h2> <ul> <li>Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit</li> <li>We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc</li> -<li>After running DSpace for over five years I&rsquo;ve never needed to look in any other log file than dspace.log, leave alone one from last year!</li> -<li>This will save us a few gigs of backup space we&rsquo;re paying for on S3</li> +<li>After running DSpace for over five years I've never needed to look in any other log file than dspace.log, leave alone one from last year!</li> +<li>This will save us a few gigs of backup space we're paying for on S3</li> <li>Also, I noticed the <code>checker</code> log has some errors we should pay attention to:</li> </ul> @@ -1091,11 +968,10 @@ text_value Wed, 02 Mar 2016 16:50:00 +0300 https://alanorth.github.io/cgspace-notes/2016-03/ - <h2 id="2016-03-02">2016-03-02</h2> - + <h2 id="20160302">2016-03-02</h2> <ul> <li>Looking at issues with author authorities on CGSpace</li> -<li>For some reason we still have the <code>index-lucene-update</code> cron job active on CGSpace, but I&rsquo;m pretty sure we don&rsquo;t need it as of the latest few versions of Atmire&rsquo;s Listings and Reports module</li> +<li>For some reason we still have the <code>index-lucene-update</code> cron job active on CGSpace, but I'm pretty sure we don't need it as of the latest few versions of Atmire's Listings and Reports module</li> <li>Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server</li> </ul> @@ -1106,16 +982,13 @@ text_value Fri, 05 Feb 2016 13:18:00 +0300 https://alanorth.github.io/cgspace-notes/2016-02/ - <h2 id="2016-02-05">2016-02-05</h2> - + <h2 id="20160205">2016-02-05</h2> <ul> <li>Looking at some DAGRIS data for Abenet Yabowork</li> <li>Lots of issues with spaces, newlines, etc causing the import to fail</li> <li>I noticed we have a very <em>interesting</em> list of countries on CGSpace:</li> </ul> - -<p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/02/cgspace-countries.png" alt="CGSpace country list" /></p> - +<p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/02/cgspace-countries.png" alt="CGSpace country list"></p> <ul> <li>Not only are there 49,000 countries, we have some blanks (25)&hellip;</li> <li>Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&rdquo;</li> @@ -1128,8 +1001,7 @@ text_value Wed, 13 Jan 2016 13:18:00 +0300 https://alanorth.github.io/cgspace-notes/2016-01/ - <h2 id="2016-01-13">2016-01-13</h2> - + <h2 id="20160113">2016-01-13</h2> <ul> <li>Move ILRI collection <code>10568/12503</code> from <code>10568/27869</code> to <code>10568/27629</code> using the <a href="https://gist.github.com/alanorth/392c4660e8b022d99dfa">move_collections.sh</a> script I wrote last year.</li> <li>I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.</li> @@ -1143,18 +1015,16 @@ text_value Wed, 02 Dec 2015 13:18:00 +0300 https://alanorth.github.io/cgspace-notes/2015-12/ - <h2 id="2015-12-02">2015-12-02</h2> - + <h2 id="20151202">2015-12-02</h2> <ul> -<li><p>Replace <code>lzop</code> with <code>xz</code> in log compression cron jobs on DSpace Test—it uses less space:</p> - +<li>Replace <code>lzop</code> with <code>xz</code> in log compression cron jobs on DSpace Test—it uses less space:</li> +</ul> <pre><code># cd /home/dspacetest.cgiar.org/log # ls -lh dspace.log.2015-11-18* -rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18 -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo -rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz -</code></pre></li> -</ul> +</code></pre> @@ -1163,18 +1033,15 @@ text_value Mon, 23 Nov 2015 17:00:57 +0300 https://alanorth.github.io/cgspace-notes/2015-11/ - <h2 id="2015-11-22">2015-11-22</h2> - + <h2 id="20151122">2015-11-22</h2> <ul> <li>CGSpace went down</li> <li>Looks like DSpace exhausted its PostgreSQL connection pool</li> - -<li><p>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</p> - +<li>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</li> +</ul> <pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace 78 -</code></pre></li> -</ul> +</code></pre> diff --git a/docs/page/2/index.html b/docs/page/2/index.html index 18ff91195..300bee324 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -9,13 +9,12 @@ - - + @@ -100,40 +99,34 @@

    -

    2019-02-01

    - +

    2019-02-01

    • Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
    • - -
    • The top IPs before, during, and after this latest alert tonight were:

      - +
    • The top IPs before, during, and after this latest alert tonight were:
    • +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -245 207.46.13.5
    -332 54.70.40.11
    -385 5.143.231.38
    -405 207.46.13.173
    -405 207.46.13.75
    -1117 66.249.66.219
    -1121 35.237.175.180
    -1546 5.9.6.51
    -2474 45.5.186.2
    -5490 85.25.237.71
    -
    - -
  • 85.25.237.71 is the “Linguee Bot” that I first saw last month

  • - -
  • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase

  • - -
  • There were just over 3 million accesses in the nginx logs last month:

    - + 245 207.46.13.5 + 332 54.70.40.11 + 385 5.143.231.38 + 405 207.46.13.173 + 405 207.46.13.75 + 1117 66.249.66.219 + 1121 35.237.175.180 + 1546 5.9.6.51 + 2474 45.5.186.2 + 5490 85.25.237.71 +
      +
    • 85.25.237.71 is the “Linguee Bot” that I first saw last month
    • +
    • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
    • +
    • There were just over 3 million accesses in the nginx logs last month:
    • +
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
     3018243
     
     real    0m19.873s
     user    0m22.203s
     sys     0m1.979s
    -
  • - + Read more → @@ -151,26 +144,23 @@ sys 0m1.979s

    -

    2019-01-02

    - +

    2019-01-02

    • Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
    • - -
    • I don’t see anything interesting in the web server logs around that time though:

      - -
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      - 92 40.77.167.4
      - 99 210.7.29.100
      -120 38.126.157.45
      -177 35.237.175.180
      -177 40.77.167.32
      -216 66.249.75.219
      -225 18.203.76.93
      -261 46.101.86.248
      -357 207.46.13.1
      -903 54.70.40.11
      -
    • +
    • I don't see anything interesting in the web server logs around that time though:
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +     92 40.77.167.4
    +     99 210.7.29.100
    +    120 38.126.157.45
    +    177 35.237.175.180
    +    177 40.77.167.32
    +    216 66.249.75.219
    +    225 18.203.76.93
    +    261 46.101.86.248
    +    357 207.46.13.1
    +    903 54.70.40.11
    +
    Read more → @@ -188,16 +178,13 @@ sys 0m1.979s

    -

    2018-12-01

    - +

    2018-12-01

    • Switch CGSpace (linode18) to use OpenJDK instead of Oracle JDK
    • I manually installed OpenJDK, then removed Oracle JDK, then re-ran the Ansible playbook to update all configuration files, etc
    • Then I ran all system updates and restarted the server
    - -

    2018-12-02

    - +

    2018-12-02

    @@ -218,15 +205,12 @@ sys 0m1.979s

    -

    2018-11-01

    - +

    2018-11-01

    • Finalize AReS Phase I and Phase II ToRs
    • Send a note about my dspace-statistics-api to the dspace-tech mailing list
    - -

    2018-11-03

    - +

    2018-11-03

    • Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage
    • Today these are the top 10 IPs:
    • @@ -248,11 +232,10 @@ sys 0m1.979s

      -

      2018-10-01

      - +

      2018-10-01

      • Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
      • -
      • I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now
      • +
      • I created a GitHub issue to track this #389, because I'm super busy in Nairobi right now
      Read more → @@ -271,13 +254,12 @@ sys 0m1.979s

      -

      2018-09-02

      - +

      2018-09-02

      • New PostgreSQL JDBC driver version 42.2.5
      • -
      • I’ll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
      • -
      • Also, I’ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month
      • -
      • I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again:
      • +
      • I'll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
      • +
      • Also, I'll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month
      • +
      • I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:
      Read more → @@ -296,27 +278,20 @@ sys 0m1.979s

      -

      2018-08-01

      - +

      2018-08-01

        -
      • DSpace Test had crashed at some point yesterday morning and I see the following in dmesg:

        - +
      • DSpace Test had crashed at some point yesterday morning and I see the following in dmesg:
      • +
      [Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
       [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
       [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      -
      - -
    • Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight

    • - -
    • From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat’s

    • - -
    • I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…

    • - -
    • Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core

    • - -
    • The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes

    • - -
    • I ran all system updates on DSpace Test and rebooted it

    • +
        +
      • Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
      • +
      • From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat's
      • +
      • I'm not sure why Tomcat didn't crash with an OutOfMemoryError…
      • +
      • Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
      • +
      • The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes
      • +
      • I ran all system updates on DSpace Test and rebooted it
      Read more → @@ -335,19 +310,16 @@ sys 0m1.979s

      -

      2018-07-01

      - +

      2018-07-01

        -
      • I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:

        - -
        $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
        -
      • - -
      • During the mvn package stage on the 5.8 branch I kept getting issues with java running out of memory:

        - -
        There is insufficient memory for the Java Runtime Environment to continue.
        -
      • +
      • I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:
      +
      $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
      +
        +
      • During the mvn package stage on the 5.8 branch I kept getting issues with java running out of memory:
      • +
      +
      There is insufficient memory for the Java Runtime Environment to continue.
      +
      Read more → @@ -365,32 +337,27 @@ sys 0m1.979s

      -

      2018-06-04

      - +

      2018-06-04

      • Test the DSpace 5.8 module upgrades from Atmire (#378) -
          -
        • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn’t build
        • -
      • +
      • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn't build
      • +
      +
    • I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
    • - -
    • I proofed and tested the ILRI author corrections that Peter sent back to me this week:

      - +
    • I proofed and tested the ILRI author corrections that Peter sent back to me this week:
    • +
    $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
    -
    - -
  • I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018

  • - -
  • Time to index ~70,000 items on CGSpace:

    - +
      +
    • I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018
    • +
    • Time to index ~70,000 items on CGSpace:
    • +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
     
     real    74m42.646s
     user    8m5.056s
     sys     2m7.289s
    -
  • - + Read more → @@ -408,15 +375,14 @@ sys 2m7.289s

    -

    2018-05-01

    - +

    2018-05-01

    +
  • Then I reduced the JVM heap size from 6144 back to 5120m
  • Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the Ansible infrastructure scripts to support hosts choosing which distribution they want to use
  • diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 33c840ae3..dd4619085 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -9,13 +9,12 @@ - - + @@ -100,10 +99,9 @@

    -

    2018-04-01

    - +

    2018-04-01

      -
    • I tried to test something on DSpace Test but noticed that it’s down since god knows when
    • +
    • I tried to test something on DSpace Test but noticed that it's down since god knows when
    • Catalina logs at least show some memory errors yesterday:
    Read more → @@ -123,8 +121,7 @@

    -

    2018-03-02

    - +

    2018-03-02

    • Export a CSV of the IITA community metadata for Martin Mueller
    @@ -145,13 +142,12 @@

    -

    2018-02-01

    - +

    2018-02-01

    • Peter gave feedback on the dc.rights proof of concept that I had sent him last week
    • -
    • We don’t need to distinguish between internal and external works, so that makes it just a simple list
    • +
    • We don't need to distinguish between internal and external works, so that makes it just a simple list
    • Yesterday I figured out how to monitor DSpace sessions using JMX
    • -
    • I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
    • +
    • I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu's munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
    Read more → @@ -170,33 +166,26 @@

    -

    2018-01-02

    - +

    2018-01-02

    • Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time
    • -
    • I didn’t get any load alerts from Linode and the REST and XMLUI logs don’t show anything out of the ordinary
    • +
    • I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary
    • The nginx logs show HTTP 200s until 02/Jan/2018:11:27:17 +0000 when Uptime Robot got an HTTP 500
    • In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
    • - -
    • And just before that I see this:

      - +
    • And just before that I see this:
    • +
    Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
    -
    - -
  • Ah hah! So the pool was actually empty!

  • - -
  • I need to increase that, let’s try to bump it up from 50 to 75

  • - -
  • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw

  • - -
  • I notice this error quite a few times in dspace.log:

    - +
      +
    • Ah hah! So the pool was actually empty!
    • +
    • I need to increase that, let's try to bump it up from 50 to 75
    • +
    • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw
    • +
    • I notice this error quite a few times in dspace.log:
    • +
    2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
     org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.
    -
  • - -
  • And there are many of these errors every day for the past month:

    - +
      +
    • And there are many of these errors every day for the past month:
    • +
    $ grep -c "Error while searching for sidebar facets" dspace.log.*
     dspace.log.2017-11-21:4
     dspace.log.2017-11-22:1
    @@ -241,9 +230,8 @@ dspace.log.2017-12-30:89
     dspace.log.2017-12-31:53
     dspace.log.2018-01-01:45
     dspace.log.2018-01-02:34
    -
  • - -
  • Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains

  • +
      +
    • Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains
    Read more → @@ -262,8 +250,7 @@ dspace.log.2018-01-02:34

    -

    2017-12-01

    - +

    2017-12-01

    • Uptime Robot noticed that CGSpace went down
    • The logs say “Timeout waiting for idle object”
    • @@ -287,27 +274,22 @@ dspace.log.2018-01-02:34

      -

      2017-11-01

      - +

      2017-11-01

      • The CORE developers responded to say they are looking into their bot not respecting our robots.txt
      - -

      2017-11-02

      - +

      2017-11-02

        -
      • Today there have been no hits by CORE and no alerts from Linode (coincidence?)

        - +
      • Today there have been no hits by CORE and no alerts from Linode (coincidence?)
      • +
      # grep -c "CORE" /var/log/nginx/access.log
       0
      -
      - -
    • Generate list of authors on CGSpace for Peter to go through and correct:

      - +
        +
      • Generate list of authors on CGSpace for Peter to go through and correct:
      • +
      dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
       COPY 54701
      -
    • -
    + Read more → @@ -325,17 +307,14 @@ COPY 54701

    -

    2017-10-01

    - +

    2017-10-01

    http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
    -
    - -
  • There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine

  • - -
  • Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections

  • +
      +
    • There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
    • +
    • Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
    Read more → @@ -374,16 +353,13 @@ COPY 54701

    -

    2017-09-06

    - +

    2017-09-06

    • Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two hours
    - -

    2017-09-07

    - +

    2017-09-07

      -
    • Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group
    • +
    • Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group
    Read more → @@ -402,22 +378,21 @@ COPY 54701

    -

    2017-08-01

    - +

    2017-08-01

    • Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
    • I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
    • The good thing is that, according to dspace.log.2017-08-01, they are all using the same Tomcat session
    • This means our Tomcat Crawler Session Valve is working
    • But many of the bots are browsing dynamic URLs like: -
      • /handle/10568/3353/discover
      • /handle/10568/16510/browse
      • -
    • +
    +
  • The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these!
  • Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
  • -
  • It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
  • +
  • It turns out that we're already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
  • Also, the bot has to successfully browse the page first so it can receive the HTTP header…
  • We might actually have to block these requests with HTTP 403 depending on the user agent
  • Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
  • diff --git a/docs/page/4/index.html b/docs/page/4/index.html index d6cbf41cb..53fc2ec7d 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -9,13 +9,12 @@ - - + @@ -100,18 +99,15 @@

    -

    2017-07-01

    - +

    2017-07-01

    • Run system updates and reboot DSpace Test
    - -

    2017-07-04

    - +

    2017-07-04

    • Merge changes for WLE Phase II theme rename (#329)
    • -
    • Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace
    • -
    • We can use PostgreSQL’s extended output format (-x) plus sed to format the output into quasi XML:
    • +
    • Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace
    • +
    • We can use PostgreSQL's extended output format (-x) plus sed to format the output into quasi XML:
    Read more → @@ -130,7 +126,7 @@

    - 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we’ll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. + 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we'll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. Read more → @@ -148,7 +144,7 @@

    - 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it’s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire’s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. + 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it's a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire's CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. Read more → @@ -166,23 +162,18 @@

    -

    2017-04-02

    - +

    2017-04-02

    • Merge one change to CCAFS flagships that I had forgotten to remove last month (“MANAGING CLIMATE RISK”): https://github.com/ilri/DSpace/pull/317
    • Quick proof-of-concept hack to add dc.rights to the input form, including some inline instructions/hints:
    - -

    dc.rights in the submission form

    - +

    dc.rights in the submission form

    • Remove redundant/duplicate text in the DSpace submission license
    • - -
    • Testing the CMYK patch on a collection with 650 items:

      - -
      $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
      -
    • +
    • Testing the CMYK patch on a collection with 650 items:
    +
    $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
    +
    Read more → @@ -200,14 +191,11 @@

    -

    2017-03-01

    - +

    2017-03-01

    • Run the 279 CIAT author corrections on CGSpace
    - -

    2017-03-02

    - +

    2017-03-02

    • Skype with Michael and Peter, discussing moving the CGIAR Library to CGSpace
    • CGIAR people possibly open to moving content, redirecting library.cgiar.org to CGSpace and letting CGSpace resolve their handles
    • @@ -217,13 +205,11 @@
    • Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
    • Filed an issue on DSpace issue tracker for the filter-media bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
    • Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
    • - -
    • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 1056851999):

      - +
    • Interestingly, it seems DSpace 4.x's thumbnails were sRGB, but forcing regeneration using DSpace 5.x's ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
    • +
    $ identify ~/Desktop/alc_contrastes_desafios.jpg
     /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
    -
    - + Read more → @@ -241,25 +227,22 @@

    -

    2017-02-07

    - +

    2017-02-07

      -
    • An item was mapped twice erroneously again, so I had to remove one of the mappings manually:

      - +
    • An item was mapped twice erroneously again, so I had to remove one of the mappings manually:
    • +
    dspace=# select * from collection2item where item_id = '80278';
    -id   | collection_id | item_id
    +  id   | collection_id | item_id
     -------+---------------+---------
    -92551 |           313 |   80278
    -92550 |           313 |   80278
    -90774 |          1051 |   80278
    + 92551 |           313 |   80278
    + 92550 |           313 |   80278
    + 90774 |          1051 |   80278
     (3 rows)
     dspace=# delete from collection2item where id = 92551 and item_id = 80278;
     DELETE 1
    -
    - -
  • Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)

  • - -
  • Looks like we’ll be using cg.identifier.ccafsprojectpii as the field name

  • +
      +
    • Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
    • +
    • Looks like we'll be using cg.identifier.ccafsprojectpii as the field name
    Read more → @@ -278,12 +261,11 @@ DELETE 1

    -

    2017-01-02

    - +

    2017-01-02

    • I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error
    • -
    • I tested on DSpace Test as well and it doesn’t work there either
    • -
    • I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years
    • +
    • I tested on DSpace Test as well and it doesn't work there either
    • +
    • I asked on the dspace-tech mailing list because it seems to be broken, and actually now I'm not sure if we've ever had the sharding task run successfully over all these years
    Read more → @@ -302,25 +284,20 @@ DELETE 1

    -

    2016-12-02

    - +

    2016-12-02

    • CGSpace was down for five hours in the morning while I was sleeping
    • - -
    • While looking in the logs for errors, I see tons of warnings about Atmire MQM:

      - +
    • While looking in the logs for errors, I see tons of warnings about Atmire MQM:
    • +
    2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
    -
    - -
  • I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade

  • - -
  • I’ve raised a ticket with Atmire to ask

  • - -
  • Another worrying error from dspace.log is:

  • +
      +
    • I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade
    • +
    • I've raised a ticket with Atmire to ask
    • +
    • Another worrying error from dspace.log is:
    Read more → @@ -339,13 +316,11 @@ DELETE 1

    -

    2016-11-01

    - +

    2016-11-01

      -
    • Add dc.type to the output options for Atmire’s Listings and Reports module (#286)
    • +
    • Add dc.type to the output options for Atmire's Listings and Reports module (#286)
    - -

    Listings and Reports with output type

    +

    Listings and Reports with output type

    Read more → @@ -363,22 +338,19 @@ DELETE 1

    -

    2016-10-03

    - +

    2016-10-03

    • Testing adding ORCIDs to a CSV file for a single item to see if the author orders get messed up
    • Need to test the following scenarios to see how author order is affected: -
      • ORCIDs only
      • ORCIDs plus normal authors
      • -
    • - -
    • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:

      - -
      0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
      -
    + +
  • I exported a random item's metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
  • + +
    0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
    +
    Read more → diff --git a/docs/page/5/index.html b/docs/page/5/index.html index 22990b9ef..25550f54a 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -9,13 +9,12 @@ - - + @@ -100,18 +99,15 @@

    -

    2016-09-01

    - +

    2016-09-01

    • Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors
    • -
    • Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace
    • +
    • Discuss how the migration of CGIAR's Active Directory to a flat structure will break our LDAP groups in DSpace
    • We had been using DC=ILRI to determine whether a user was ILRI or not
    • - -
    • It looks like we might be able to use OUs now, instead of DCs:

      - -
      $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
      -
    • +
    • It looks like we might be able to use OUs now, instead of DCs:
    +
    $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
    +
    Read more → @@ -129,22 +125,19 @@

    -

    2016-08-01

    - +

    2016-08-01

    • Add updated distribution license from Sisay (#259)
    • Play with upgrading Mirage 2 dependencies in bower.json because most are several versions of out date
    • Bootstrap is at 3.3.0 but upstream is at 3.3.7, and upgrading to anything beyond 3.3.1 breaks glyphicons and probably more
    • bower stuff is a dead end, waste of time, too many issues
    • Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of fonts)
    • - -
    • Start working on DSpace 5.1 → 5.5 port:

      - +
    • Start working on DSpace 5.1 → 5.5 port:
    • +
    $ git checkout -b 55new 5_x-prod
     $ git reset --hard ilri/5_x-prod
     $ git rebase -i dspace-5.5
    -
    - + Read more → @@ -162,22 +155,19 @@ $ git rebase -i dspace-5.5

    -

    2016-07-01

    - +

    2016-07-01

    • Add dc.description.sponsorship to Discovery sidebar facets and make investors clickable in item view (#232)
    • - -
    • I think this query should find and replace all authors that have “,” at the end of their names:

      - +
    • I think this query should find and replace all authors that have “,” at the end of their names:
    • +
    dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
     UPDATE 95
     dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
    -text_value
    + text_value
     ------------
     (0 rows)
    -
    - -
  • In this case the select query was showing 95 results before the update

  • +
      +
    • In this case the select query was showing 95 results before the update
    Read more → @@ -196,11 +186,10 @@ text_value

    -

    2016-06-01

    - +

    2016-06-01

    + Read more → @@ -252,13 +238,12 @@ text_value

    -

    2016-04-04

    - +

    2016-04-04

    • Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit
    • We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc
    • -
    • After running DSpace for over five years I’ve never needed to look in any other log file than dspace.log, leave alone one from last year!
    • -
    • This will save us a few gigs of backup space we’re paying for on S3
    • +
    • After running DSpace for over five years I've never needed to look in any other log file than dspace.log, leave alone one from last year!
    • +
    • This will save us a few gigs of backup space we're paying for on S3
    • Also, I noticed the checker log has some errors we should pay attention to:
    Read more → @@ -278,11 +263,10 @@ text_value

    -

    2016-03-02

    - +

    2016-03-02

    • Looking at issues with author authorities on CGSpace
    • -
    • For some reason we still have the index-lucene-update cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module
    • +
    • For some reason we still have the index-lucene-update cron job active on CGSpace, but I'm pretty sure we don't need it as of the latest few versions of Atmire's Listings and Reports module
    • Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server
    Read more → @@ -302,16 +286,13 @@ text_value

    -

    2016-02-05

    - +

    2016-02-05

    • Looking at some DAGRIS data for Abenet Yabowork
    • Lots of issues with spaces, newlines, etc causing the import to fail
    • I noticed we have a very interesting list of countries on CGSpace:
    - -

    CGSpace country list

    - +

    CGSpace country list

    • Not only are there 49,000 countries, we have some blanks (25)…
    • Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE”
    • @@ -333,8 +314,7 @@ text_value

      -

      2016-01-13

      - +

      2016-01-13

      • Move ILRI collection 10568/12503 from 10568/27869 to 10568/27629 using the move_collections.sh script I wrote last year.
      • I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.
      • @@ -357,18 +337,16 @@ text_value

        -

        2015-12-02

        - +

        2015-12-02

          -
        • Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less space:

          - +
        • Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less space:
        • +
        # cd /home/dspacetest.cgiar.org/log
         # ls -lh dspace.log.2015-11-18*
         -rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
         -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
         -rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz
        -
        -
      + Read more → diff --git a/docs/page/6/index.html b/docs/page/6/index.html index 90c1cb99c..be4368797 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -9,13 +9,12 @@ - - + @@ -100,18 +99,15 @@

      -

      2015-11-22

      - +

      2015-11-22

      • CGSpace went down
      • Looks like DSpace exhausted its PostgreSQL connection pool
      • - -
      • Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:

        - +
      • Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:
      • +
      $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
       78
      -
      -
    + Read more → diff --git a/docs/posts/index.html b/docs/posts/index.html index 4c27ded2f..d0c346c9e 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -9,13 +9,12 @@ - - + @@ -100,31 +99,27 @@

    -

    2019-11-04

    - +

    2019-11-04

      -
    • Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics

      - +
    • Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics
        -
      • I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:

        - +
      • I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:
      • +
      +
    • +
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
     4671942
     # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
     1277694
    -
    - - -
  • So 4.6 million from XMLUI and another 1.2 million from API requests

  • - -
  • Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):

    - +
      +
    • So 4.6 million from XMLUI and another 1.2 million from API requests
    • +
    • Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
    • +
    # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
     1183456 
     # zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
     106781
    -
  • - + Read more → @@ -145,7 +140,6 @@

    Possible changes to CGSpace metadata fields to align more with DC, QDC, and DCTERMS as well as CG Core v2.

    -

    With reference to CG Core v2 draft standard by Marie-Angélique as well as DCMI DCTERMS.

    Read more → @@ -164,8 +158,7 @@

    - 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace - I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: + 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's “unneccesary Unicode” fix: $ csvcut -c 'id,dc. Read more → @@ -183,37 +176,34 @@

    -

    2019-09-01

    - +

    2019-09-01

    • Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning
    • - -
    • Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:

      - -
      # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      -440 17.58.101.255
      -441 157.55.39.101
      -485 207.46.13.43
      -728 169.60.128.125
      -730 207.46.13.108
      -758 157.55.39.9
      -808 66.160.140.179
      -814 207.46.13.212
      -2472 163.172.71.23
      -6092 3.94.211.189
      -# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      - 33 2a01:7e00::f03c:91ff:fe16:fcb
      - 57 3.83.192.124
      - 57 3.87.77.25
      - 57 54.82.1.8
      -822 2a01:9cc0:47:1:1a:4:0:2
      -1223 45.5.184.72
      -1633 172.104.229.92
      -5112 205.186.128.185
      -7249 2a01:7e00::f03c:91ff:fe18:7396
      -9124 45.5.186.2
      -
    • +
    • Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
    +
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    440 17.58.101.255
    +    441 157.55.39.101
    +    485 207.46.13.43
    +    728 169.60.128.125
    +    730 207.46.13.108
    +    758 157.55.39.9
    +    808 66.160.140.179
    +    814 207.46.13.212
    +   2472 163.172.71.23
    +   6092 3.94.211.189
    +# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +     33 2a01:7e00::f03c:91ff:fe16:fcb
    +     57 3.83.192.124
    +     57 3.87.77.25
    +     57 54.82.1.8
    +    822 2a01:9cc0:47:1:1a:4:0:2
    +   1223 45.5.184.72
    +   1633 172.104.229.92
    +   5112 205.186.128.185
    +   7249 2a01:7e00::f03c:91ff:fe18:7396
    +   9124 45.5.186.2
    +
    Read more → @@ -231,22 +221,19 @@

    -

    2019-08-03

    - +

    2019-08-03

      -
    • Look at Bioversity’s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…
    • +
    • Look at Bioversity's latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…
    - -

    2019-08-04

    - +

    2019-08-04

    • Deploy ORCID identifier updates requested by Bioversity to CGSpace
    • Run system updates on CGSpace (linode18) and reboot it -
      • Before updating it I checked Solr and verified that all statistics cores were loaded properly…
      • -
      • After rebooting, all statistics cores were loaded… wow, that’s lucky.
      • -
    • +
    • After rebooting, all statistics cores were loaded… wow, that's lucky.
    • +
    +
  • Run system updates on DSpace Test (linode19) and reboot it
  • Read more → @@ -266,16 +253,15 @@

    -

    2019-07-01

    - +

    2019-07-01

    • Create an “AfricaRice books and book chapters” collection on CGSpace for AfricaRice
    • Last month Sisay asked why the following “most popular” statistics link for a range of months in 2018 works for the CIAT community on DSpace Test, but not on CGSpace: -
    • +
    +
  • Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community
  • Read more → @@ -295,15 +281,12 @@

    -

    2019-06-02

    - +

    2019-06-02

    - -

    2019-06-03

    - +

    2019-06-03

    • Skype with Marie-Angélique and Abenet about CG Core v2
    @@ -324,24 +307,21 @@

    -

    2019-05-01

    - +

    2019-05-01

    • Help CCAFS with regenerating some item thumbnails after they uploaded new PDFs to some items on CGSpace
    • A user on the dspace-tech mailing list offered some suggestions for troubleshooting the problem with the inability to delete certain items -
      • Apparently if the item is in the workflowitem table it is submitted to a workflow
      • And if it is in the workspaceitem table it is in the pre-submitted state
      • -
    • - -
    • The item seems to be in a pre-submitted state, so I tried to delete it from there:

      - +
    + +
  • The item seems to be in a pre-submitted state, so I tried to delete it from there:
  • +
    dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
     DELETE 1
    -
    - -
  • But after this I tried to delete the item from the XMLUI and it is still present…

  • +
      +
    • But after this I tried to delete the item from the XMLUI and it is still present…
    Read more → @@ -360,35 +340,30 @@ DELETE 1

    -

    2019-04-01

    - +

    2019-04-01

    • Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc -
      • They asked if we had plans to enable RDF support in CGSpace
      • -
    • - -
    • There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today

      - +
    + +
  • There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today
      -
    • I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!

      - +
    • I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!
    • +
    +
  • +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
    -4432 200
    -
    - - -
  • In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses

  • - -
  • Apply country and region corrections and deletions on DSpace Test and CGSpace:

    - + 4432 200 +
      +
    • In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
    • +
    • Apply country and region corrections and deletions on DSpace Test and CGSpace:
    • +
    $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
     $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
    -
  • - + Read more → @@ -406,20 +381,19 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace

    -

    2019-03-01

    - +

    2019-03-01

      -
    • I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
    • +
    • I checked IITA's 259 Feb 14 records from last month for duplicates using Atmire's Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
    • I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…
    • -
    • Looking at the other half of Udana’s WLE records from 2018-11 - +
    • Looking at the other half of Udana's WLE records from 2018-11
      • I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)
      • I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items
      • Most worryingly, there are encoding errors in the abstracts for eleven items, for example:
      • 68.15% � 9.45 instead of 68.15% ± 9.45
      • 2003�2013 instead of 2003–2013
      • -
    • +
    +
  • I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs
  • Read more → diff --git a/docs/posts/index.xml b/docs/posts/index.xml index fef3b1ce4..90f990bf7 100644 --- a/docs/posts/index.xml +++ b/docs/posts/index.xml @@ -17,31 +17,27 @@ Mon, 04 Nov 2019 12:20:30 +0200 https://alanorth.github.io/cgspace-notes/2019-11/ - <h2 id="2019-11-04">2019-11-04</h2> - + <h2 id="20191104">2019-11-04</h2> <ul> -<li><p>Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics</p> - +<li>Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics <ul> -<li><p>I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:</p> - +<li>I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:</li> +</ul> +</li> +</ul> <pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot; 4671942 # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot; 1277694 -</code></pre></li> -</ul></li> - -<li><p>So 4.6 million from XMLUI and another 1.2 million from API requests</p></li> - -<li><p>Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</p> - +</code></pre><ul> +<li>So 4.6 million from XMLUI and another 1.2 million from API requests</li> +<li>Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li> +</ul> <pre><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot; 1183456 # zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot; 106781 -</code></pre></li> -</ul> +</code></pre> @@ -51,7 +47,6 @@ https://alanorth.github.io/cgspace-notes/cgspace-cgcorev2-migration/ <p>Possible changes to CGSpace metadata fields to align more with DC, QDC, and DCTERMS as well as CG Core v2.</p> - <p>With reference to <a href="https://agriculturalsemantics.github.io/cg-core/cgcore.html">CG Core v2 draft standard</a> by Marie-Angélique as well as <a href="http://www.dublincore.org/specifications/dublin-core/dcmi-terms/">DCMI DCTERMS</a>.</p> @@ -61,8 +56,7 @@ Tue, 01 Oct 2019 13:20:51 +0300 https://alanorth.github.io/cgspace-notes/2019-10/ - 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace - I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: + 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c 'id,dc. @@ -71,37 +65,34 @@ Sun, 01 Sep 2019 10:17:51 +0300 https://alanorth.github.io/cgspace-notes/2019-09/ - <h2 id="2019-09-01">2019-09-01</h2> - + <h2 id="20190901">2019-09-01</h2> <ul> <li>Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning</li> - -<li><p>Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:</p> - +<li>Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:</li> +</ul> <pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 -440 17.58.101.255 -441 157.55.39.101 -485 207.46.13.43 -728 169.60.128.125 -730 207.46.13.108 -758 157.55.39.9 -808 66.160.140.179 -814 207.46.13.212 -2472 163.172.71.23 -6092 3.94.211.189 + 440 17.58.101.255 + 441 157.55.39.101 + 485 207.46.13.43 + 728 169.60.128.125 + 730 207.46.13.108 + 758 157.55.39.9 + 808 66.160.140.179 + 814 207.46.13.212 + 2472 163.172.71.23 + 6092 3.94.211.189 # zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 - 33 2a01:7e00::f03c:91ff:fe16:fcb - 57 3.83.192.124 - 57 3.87.77.25 - 57 54.82.1.8 -822 2a01:9cc0:47:1:1a:4:0:2 -1223 45.5.184.72 -1633 172.104.229.92 -5112 205.186.128.185 -7249 2a01:7e00::f03c:91ff:fe18:7396 -9124 45.5.186.2 -</code></pre></li> -</ul> + 33 2a01:7e00::f03c:91ff:fe16:fcb + 57 3.83.192.124 + 57 3.87.77.25 + 57 54.82.1.8 + 822 2a01:9cc0:47:1:1a:4:0:2 + 1223 45.5.184.72 + 1633 172.104.229.92 + 5112 205.186.128.185 + 7249 2a01:7e00::f03c:91ff:fe18:7396 + 9124 45.5.186.2 +</code></pre> @@ -110,22 +101,19 @@ Sat, 03 Aug 2019 12:39:51 +0300 https://alanorth.github.io/cgspace-notes/2019-08/ - <h2 id="2019-08-03">2019-08-03</h2> - + <h2 id="20190803">2019-08-03</h2> <ul> -<li>Look at Bioversity&rsquo;s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name&hellip;</li> +<li>Look at Bioversity's latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name&hellip;</li> </ul> - -<h2 id="2019-08-04">2019-08-04</h2> - +<h2 id="20190804">2019-08-04</h2> <ul> <li>Deploy ORCID identifier updates requested by Bioversity to CGSpace</li> <li>Run system updates on CGSpace (linode18) and reboot it - <ul> <li>Before updating it I checked Solr and verified that all statistics cores were loaded properly&hellip;</li> -<li>After rebooting, all statistics cores were loaded&hellip; wow, that&rsquo;s lucky.</li> -</ul></li> +<li>After rebooting, all statistics cores were loaded&hellip; wow, that's lucky.</li> +</ul> +</li> <li>Run system updates on DSpace Test (linode19) and reboot it</li> </ul> @@ -136,16 +124,15 @@ Mon, 01 Jul 2019 12:13:51 +0300 https://alanorth.github.io/cgspace-notes/2019-07/ - <h2 id="2019-07-01">2019-07-01</h2> - + <h2 id="20190701">2019-07-01</h2> <ul> <li>Create an &ldquo;AfricaRice books and book chapters&rdquo; collection on CGSpace for AfricaRice</li> <li>Last month Sisay asked why the following &ldquo;most popular&rdquo; statistics link for a range of months in 2018 works for the CIAT community on DSpace Test, but not on CGSpace: - <ul> <li><a href="https://dspacetest.cgiar.org/handle/10568/35697/most-popular/item#simplefilter=custom&amp;time_filter_end_date=01%2F12%2F2018">DSpace Test</a></li> <li><a href="https://cgspace.cgiar.org/handle/10568/35697/most-popular/item#simplefilter=custom&amp;time_filter_end_date=01%2F12%2F2018">CGSpace</a></li> -</ul></li> +</ul> +</li> <li>Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community</li> </ul> @@ -156,15 +143,12 @@ Sun, 02 Jun 2019 10:57:51 +0300 https://alanorth.github.io/cgspace-notes/2019-06/ - <h2 id="2019-06-02">2019-06-02</h2> - + <h2 id="20190602">2019-06-02</h2> <ul> <li>Merge the <a href="https://github.com/ilri/DSpace/pull/425">Solr filterCache</a> and <a href="https://github.com/ilri/DSpace/pull/426">XMLUI ISI journal</a> changes to the <code>5_x-prod</code> branch and deploy on CGSpace</li> <li>Run system updates on CGSpace (linode18) and reboot it</li> </ul> - -<h2 id="2019-06-03">2019-06-03</h2> - +<h2 id="20190603">2019-06-03</h2> <ul> <li>Skype with Marie-Angélique and Abenet about <a href="https://agriculturalsemantics.github.io/cg-core/cgcore.html">CG Core v2</a></li> </ul> @@ -176,24 +160,21 @@ Wed, 01 May 2019 07:37:43 +0300 https://alanorth.github.io/cgspace-notes/2019-05/ - <h2 id="2019-05-01">2019-05-01</h2> - + <h2 id="20190501">2019-05-01</h2> <ul> <li>Help CCAFS with regenerating some item thumbnails after they uploaded new PDFs to some items on CGSpace</li> <li>A user on the dspace-tech mailing list offered some suggestions for troubleshooting the problem with the inability to delete certain items - <ul> <li>Apparently if the item is in the <code>workflowitem</code> table it is submitted to a workflow</li> <li>And if it is in the <code>workspaceitem</code> table it is in the pre-submitted state</li> -</ul></li> - -<li><p>The item seems to be in a pre-submitted state, so I tried to delete it from there:</p> - +</ul> +</li> +<li>The item seems to be in a pre-submitted state, so I tried to delete it from there:</li> +</ul> <pre><code>dspace=# DELETE FROM workspaceitem WHERE item_id=74648; DELETE 1 -</code></pre></li> - -<li><p>But after this I tried to delete the item from the XMLUI and it is <em>still</em> present&hellip;</p></li> +</code></pre><ul> +<li>But after this I tried to delete the item from the XMLUI and it is <em>still</em> present&hellip;</li> </ul> @@ -203,35 +184,30 @@ DELETE 1 Mon, 01 Apr 2019 09:00:43 +0300 https://alanorth.github.io/cgspace-notes/2019-04/ - <h2 id="2019-04-01">2019-04-01</h2> - + <h2 id="20190401">2019-04-01</h2> <ul> <li>Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc - <ul> <li>They asked if we had plans to enable RDF support in CGSpace</li> -</ul></li> - -<li><p>There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today</p> - +</ul> +</li> +<li>There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today <ul> -<li><p>I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!</p> - +<li>I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!</li> +</ul> +</li> +</ul> <pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5 -4432 200 -</code></pre></li> -</ul></li> - -<li><p>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</p></li> - -<li><p>Apply country and region corrections and deletions on DSpace Test and CGSpace:</p> - + 4432 200 +</code></pre><ul> +<li>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</li> +<li>Apply country and region corrections and deletions on DSpace Test and CGSpace:</li> +</ul> <pre><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d -</code></pre></li> -</ul> +</code></pre> @@ -240,20 +216,19 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace Fri, 01 Mar 2019 12:16:30 +0100 https://alanorth.github.io/cgspace-notes/2019-03/ - <h2 id="2019-03-01">2019-03-01</h2> - + <h2 id="20190301">2019-03-01</h2> <ul> -<li>I checked IITA&rsquo;s 259 Feb 14 records from last month for duplicates using Atmire&rsquo;s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good</li> +<li>I checked IITA's 259 Feb 14 records from last month for duplicates using Atmire's Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good</li> <li>I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc&hellip;</li> -<li>Looking at the other half of Udana&rsquo;s WLE records from 2018-11 - +<li>Looking at the other half of Udana's WLE records from 2018-11 <ul> <li>I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)</li> <li>I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items</li> <li>Most worryingly, there are encoding errors in the abstracts for eleven items, for example:</li> <li>68.15% � 9.45 instead of 68.15% ± 9.45</li> <li>2003�2013 instead of 2003–2013</li> -</ul></li> +</ul> +</li> <li>I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs</li> </ul> @@ -264,40 +239,34 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace Fri, 01 Feb 2019 21:37:30 +0200 https://alanorth.github.io/cgspace-notes/2019-02/ - <h2 id="2019-02-01">2019-02-01</h2> - + <h2 id="20190201">2019-02-01</h2> <ul> <li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li> - -<li><p>The top IPs before, during, and after this latest alert tonight were:</p> - +<li>The top IPs before, during, and after this latest alert tonight were:</li> +</ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 -245 207.46.13.5 -332 54.70.40.11 -385 5.143.231.38 -405 207.46.13.173 -405 207.46.13.75 -1117 66.249.66.219 -1121 35.237.175.180 -1546 5.9.6.51 -2474 45.5.186.2 -5490 85.25.237.71 -</code></pre></li> - -<li><p><code>85.25.237.71</code> is the &ldquo;Linguee Bot&rdquo; that I first saw last month</p></li> - -<li><p>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</p></li> - -<li><p>There were just over 3 million accesses in the nginx logs last month:</p> - + 245 207.46.13.5 + 332 54.70.40.11 + 385 5.143.231.38 + 405 207.46.13.173 + 405 207.46.13.75 + 1117 66.249.66.219 + 1121 35.237.175.180 + 1546 5.9.6.51 + 2474 45.5.186.2 + 5490 85.25.237.71 +</code></pre><ul> +<li><code>85.25.237.71</code> is the &ldquo;Linguee Bot&rdquo; that I first saw last month</li> +<li>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</li> +<li>There were just over 3 million accesses in the nginx logs last month:</li> +</ul> <pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot; 3018243 real 0m19.873s user 0m22.203s sys 0m1.979s -</code></pre></li> -</ul> +</code></pre> @@ -306,26 +275,23 @@ sys 0m1.979s Wed, 02 Jan 2019 09:48:30 +0200 https://alanorth.github.io/cgspace-notes/2019-01/ - <h2 id="2019-01-02">2019-01-02</h2> - + <h2 id="20190102">2019-01-02</h2> <ul> <li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li> - -<li><p>I don&rsquo;t see anything interesting in the web server logs around that time though:</p> - +<li>I don't see anything interesting in the web server logs around that time though:</li> +</ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 - 92 40.77.167.4 - 99 210.7.29.100 -120 38.126.157.45 -177 35.237.175.180 -177 40.77.167.32 -216 66.249.75.219 -225 18.203.76.93 -261 46.101.86.248 -357 207.46.13.1 -903 54.70.40.11 -</code></pre></li> -</ul> + 92 40.77.167.4 + 99 210.7.29.100 + 120 38.126.157.45 + 177 35.237.175.180 + 177 40.77.167.32 + 216 66.249.75.219 + 225 18.203.76.93 + 261 46.101.86.248 + 357 207.46.13.1 + 903 54.70.40.11 +</code></pre> @@ -334,16 +300,13 @@ sys 0m1.979s Sun, 02 Dec 2018 02:09:30 +0200 https://alanorth.github.io/cgspace-notes/2018-12/ - <h2 id="2018-12-01">2018-12-01</h2> - + <h2 id="20181201">2018-12-01</h2> <ul> <li>Switch CGSpace (linode18) to use OpenJDK instead of Oracle JDK</li> <li>I manually installed OpenJDK, then removed Oracle JDK, then re-ran the <a href="http://github.com/ilri/rmg-ansible-public">Ansible playbook</a> to update all configuration files, etc</li> <li>Then I ran all system updates and restarted the server</li> </ul> - -<h2 id="2018-12-02">2018-12-02</h2> - +<h2 id="20181202">2018-12-02</h2> <ul> <li>I noticed that there is another issue with PDF thumbnails on CGSpace, and I see there was another <a href="https://usn.ubuntu.com/3831-1/">Ghostscript vulnerability last week</a></li> </ul> @@ -355,15 +318,12 @@ sys 0m1.979s Thu, 01 Nov 2018 16:41:30 +0200 https://alanorth.github.io/cgspace-notes/2018-11/ - <h2 id="2018-11-01">2018-11-01</h2> - + <h2 id="20181101">2018-11-01</h2> <ul> <li>Finalize AReS Phase I and Phase II ToRs</li> <li>Send a note about my <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a> to the dspace-tech mailing list</li> </ul> - -<h2 id="2018-11-03">2018-11-03</h2> - +<h2 id="20181103">2018-11-03</h2> <ul> <li>Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage</li> <li>Today these are the top 10 IPs:</li> @@ -376,11 +336,10 @@ sys 0m1.979s Mon, 01 Oct 2018 22:31:54 +0300 https://alanorth.github.io/cgspace-notes/2018-10/ - <h2 id="2018-10-01">2018-10-01</h2> - + <h2 id="20181001">2018-10-01</h2> <ul> <li>Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items</li> -<li>I created a GitHub issue to track this <a href="https://github.com/ilri/DSpace/issues/389">#389</a>, because I&rsquo;m super busy in Nairobi right now</li> +<li>I created a GitHub issue to track this <a href="https://github.com/ilri/DSpace/issues/389">#389</a>, because I'm super busy in Nairobi right now</li> </ul> @@ -390,13 +349,12 @@ sys 0m1.979s Sun, 02 Sep 2018 09:55:54 +0300 https://alanorth.github.io/cgspace-notes/2018-09/ - <h2 id="2018-09-02">2018-09-02</h2> - + <h2 id="20180902">2018-09-02</h2> <ul> <li>New <a href="https://jdbc.postgresql.org/documentation/changelog.html#version_42.2.5">PostgreSQL JDBC driver version 42.2.5</a></li> -<li>I&rsquo;ll update the DSpace role in our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a> and run the updated playbooks on CGSpace and DSpace Test</li> -<li>Also, I&rsquo;ll re-run the <code>postgresql</code> tasks because the custom PostgreSQL variables are dynamic according to the system&rsquo;s RAM, and we never re-ran them after migrating to larger Linodes last month</li> -<li>I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&rsquo;m getting those autowire errors in Tomcat 8.5.30 again:</li> +<li>I'll update the DSpace role in our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a> and run the updated playbooks on CGSpace and DSpace Test</li> +<li>Also, I'll re-run the <code>postgresql</code> tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month</li> +<li>I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:</li> </ul> @@ -406,27 +364,20 @@ sys 0m1.979s Wed, 01 Aug 2018 11:52:54 +0300 https://alanorth.github.io/cgspace-notes/2018-08/ - <h2 id="2018-08-01">2018-08-01</h2> - + <h2 id="20180801">2018-08-01</h2> <ul> -<li><p>DSpace Test had crashed at some point yesterday morning and I see the following in <code>dmesg</code>:</p> - +<li>DSpace Test had crashed at some point yesterday morning and I see the following in <code>dmesg</code>:</li> +</ul> <pre><code>[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB -</code></pre></li> - -<li><p>Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight</p></li> - -<li><p>From the DSpace log I see that eventually Solr stopped responding, so I guess the <code>java</code> process that was OOM killed above was Tomcat&rsquo;s</p></li> - -<li><p>I&rsquo;m not sure why Tomcat didn&rsquo;t crash with an OutOfMemoryError&hellip;</p></li> - -<li><p>Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core</p></li> - -<li><p>The server only has 8GB of RAM so we&rsquo;ll eventually need to upgrade to a larger one because we&rsquo;ll start starving the OS, PostgreSQL, and command line batch processes</p></li> - -<li><p>I ran all system updates on DSpace Test and rebooted it</p></li> +</code></pre><ul> +<li>Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight</li> +<li>From the DSpace log I see that eventually Solr stopped responding, so I guess the <code>java</code> process that was OOM killed above was Tomcat's</li> +<li>I'm not sure why Tomcat didn't crash with an OutOfMemoryError&hellip;</li> +<li>Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core</li> +<li>The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes</li> +<li>I ran all system updates on DSpace Test and rebooted it</li> </ul> @@ -436,19 +387,16 @@ sys 0m1.979s Sun, 01 Jul 2018 12:56:54 +0300 https://alanorth.github.io/cgspace-notes/2018-07/ - <h2 id="2018-07-01">2018-07-01</h2> - + <h2 id="20180701">2018-07-01</h2> <ul> -<li><p>I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:</p> - +<li>I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:</li> +</ul> <pre><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace -</code></pre></li> - -<li><p>During the <code>mvn package</code> stage on the 5.8 branch I kept getting issues with java running out of memory:</p> - +</code></pre><ul> +<li>During the <code>mvn package</code> stage on the 5.8 branch I kept getting issues with java running out of memory:</li> +</ul> <pre><code>There is insufficient memory for the Java Runtime Environment to continue. -</code></pre></li> -</ul> +</code></pre> @@ -457,32 +405,27 @@ sys 0m1.979s Mon, 04 Jun 2018 19:49:54 -0700 https://alanorth.github.io/cgspace-notes/2018-06/ - <h2 id="2018-06-04">2018-06-04</h2> - + <h2 id="20180604">2018-06-04</h2> <ul> <li>Test the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560">DSpace 5.8 module upgrades from Atmire</a> (<a href="https://github.com/ilri/DSpace/pull/378">#378</a>) - <ul> -<li>There seems to be a problem with the CUA and L&amp;R versions in <code>pom.xml</code> because they are using SNAPSHOT and it doesn&rsquo;t build</li> -</ul></li> +<li>There seems to be a problem with the CUA and L&amp;R versions in <code>pom.xml</code> because they are using SNAPSHOT and it doesn't build</li> +</ul> +</li> <li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li> - -<li><p>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</p> - +<li>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</li> +</ul> <pre><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n -</code></pre></li> - -<li><p>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-03/">March, 2018</a></p></li> - -<li><p>Time to index ~70,000 items on CGSpace:</p> - +</code></pre><ul> +<li>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-03/">March, 2018</a></li> +<li>Time to index ~70,000 items on CGSpace:</li> +</ul> <pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b real 74m42.646s user 8m5.056s sys 2m7.289s -</code></pre></li> -</ul> +</code></pre> @@ -491,15 +434,14 @@ sys 2m7.289s Tue, 01 May 2018 16:43:54 +0300 https://alanorth.github.io/cgspace-notes/2018-05/ - <h2 id="2018-05-01">2018-05-01</h2> - + <h2 id="20180501">2018-05-01</h2> <ul> <li>I cleared the Solr statistics core on DSpace Test by issuing two commands directly to the Solr admin interface: - <ul> -<li><a href="http://localhost:3000/solr/statistics/update?stream.body=%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E">http://localhost:3000/solr/statistics/update?stream.body=%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E</a></li> -<li><a href="http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E">http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E</a></li> -</ul></li> +<li>http://localhost:3000/solr/statistics/update?stream.body=%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E</li> +<li>http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E</li> +</ul> +</li> <li>Then I reduced the JVM heap size from 6144 back to 5120m</li> <li>Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a> to support hosts choosing which distribution they want to use</li> </ul> @@ -511,10 +453,9 @@ sys 2m7.289s Sun, 01 Apr 2018 16:13:54 +0200 https://alanorth.github.io/cgspace-notes/2018-04/ - <h2 id="2018-04-01">2018-04-01</h2> - + <h2 id="20180401">2018-04-01</h2> <ul> -<li>I tried to test something on DSpace Test but noticed that it&rsquo;s down since god knows when</li> +<li>I tried to test something on DSpace Test but noticed that it's down since god knows when</li> <li>Catalina logs at least show some memory errors yesterday:</li> </ul> @@ -525,8 +466,7 @@ sys 2m7.289s Fri, 02 Mar 2018 16:07:54 +0200 https://alanorth.github.io/cgspace-notes/2018-03/ - <h2 id="2018-03-02">2018-03-02</h2> - + <h2 id="20180302">2018-03-02</h2> <ul> <li>Export a CSV of the IITA community metadata for Martin Mueller</li> </ul> @@ -538,13 +478,12 @@ sys 2m7.289s Thu, 01 Feb 2018 16:28:54 +0200 https://alanorth.github.io/cgspace-notes/2018-02/ - <h2 id="2018-02-01">2018-02-01</h2> - + <h2 id="20180201">2018-02-01</h2> <ul> <li>Peter gave feedback on the <code>dc.rights</code> proof of concept that I had sent him last week</li> -<li>We don&rsquo;t need to distinguish between internal and external works, so that makes it just a simple list</li> +<li>We don't need to distinguish between internal and external works, so that makes it just a simple list</li> <li>Yesterday I figured out how to monitor DSpace sessions using JMX</li> -<li>I copied the logic in the <code>jmx_tomcat_dbpools</code> provided by Ubuntu&rsquo;s <code>munin-plugins-java</code> package and used the stuff I discovered about JMX <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-01/">in 2018-01</a></li> +<li>I copied the logic in the <code>jmx_tomcat_dbpools</code> provided by Ubuntu's <code>munin-plugins-java</code> package and used the stuff I discovered about JMX <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-01/">in 2018-01</a></li> </ul> @@ -554,33 +493,26 @@ sys 2m7.289s Tue, 02 Jan 2018 08:35:54 -0800 https://alanorth.github.io/cgspace-notes/2018-01/ - <h2 id="2018-01-02">2018-01-02</h2> - + <h2 id="20180102">2018-01-02</h2> <ul> <li>Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time</li> -<li>I didn&rsquo;t get any load alerts from Linode and the REST and XMLUI logs don&rsquo;t show anything out of the ordinary</li> +<li>I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary</li> <li>The nginx logs show HTTP 200s until <code>02/Jan/2018:11:27:17 +0000</code> when Uptime Robot got an HTTP 500</li> <li>In dspace.log around that time I see many errors like &ldquo;Client closed the connection before file download was complete&rdquo;</li> - -<li><p>And just before that I see this:</p> - +<li>And just before that I see this:</li> +</ul> <pre><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000]. -</code></pre></li> - -<li><p>Ah hah! So the pool was actually empty!</p></li> - -<li><p>I need to increase that, let&rsquo;s try to bump it up from 50 to 75</p></li> - -<li><p>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&rsquo;t know what the hell Uptime Robot saw</p></li> - -<li><p>I notice this error quite a few times in dspace.log:</p> - +</code></pre><ul> +<li>Ah hah! So the pool was actually empty!</li> +<li>I need to increase that, let's try to bump it up from 50 to 75</li> +<li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw</li> +<li>I notice this error quite a few times in dspace.log:</li> +</ul> <pre><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32. -</code></pre></li> - -<li><p>And there are many of these errors every day for the past month:</p> - +</code></pre><ul> +<li>And there are many of these errors every day for the past month:</li> +</ul> <pre><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.* dspace.log.2017-11-21:4 dspace.log.2017-11-22:1 @@ -625,9 +557,8 @@ dspace.log.2017-12-30:89 dspace.log.2017-12-31:53 dspace.log.2018-01-01:45 dspace.log.2018-01-02:34 -</code></pre></li> - -<li><p>Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let&rsquo;s Encrypt if it&rsquo;s just a handful of domains</p></li> +</code></pre><ul> +<li>Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains</li> </ul> @@ -637,8 +568,7 @@ dspace.log.2018-01-02:34 Fri, 01 Dec 2017 13:53:54 +0300 https://alanorth.github.io/cgspace-notes/2017-12/ - <h2 id="2017-12-01">2017-12-01</h2> - + <h2 id="20171201">2017-12-01</h2> <ul> <li>Uptime Robot noticed that CGSpace went down</li> <li>The logs say &ldquo;Timeout waiting for idle object&rdquo;</li> @@ -653,27 +583,22 @@ dspace.log.2018-01-02:34 Thu, 02 Nov 2017 09:37:54 +0200 https://alanorth.github.io/cgspace-notes/2017-11/ - <h2 id="2017-11-01">2017-11-01</h2> - + <h2 id="20171101">2017-11-01</h2> <ul> <li>The CORE developers responded to say they are looking into their bot not respecting our robots.txt</li> </ul> - -<h2 id="2017-11-02">2017-11-02</h2> - +<h2 id="20171102">2017-11-02</h2> <ul> -<li><p>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</p> - +<li>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</li> +</ul> <pre><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log 0 -</code></pre></li> - -<li><p>Generate list of authors on CGSpace for Peter to go through and correct:</p> - +</code></pre><ul> +<li>Generate list of authors on CGSpace for Peter to go through and correct:</li> +</ul> <pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv; COPY 54701 -</code></pre></li> -</ul> +</code></pre> @@ -682,17 +607,14 @@ COPY 54701 Sun, 01 Oct 2017 08:07:54 +0300 https://alanorth.github.io/cgspace-notes/2017-10/ - <h2 id="2017-10-01">2017-10-01</h2> - + <h2 id="20171001">2017-10-01</h2> <ul> -<li><p>Peter emailed to point out that many items in the <a href="https://cgspace.cgiar.org/handle/10568/2703">ILRI archive collection</a> have multiple handles:</p> - +<li>Peter emailed to point out that many items in the <a href="https://cgspace.cgiar.org/handle/10568/2703">ILRI archive collection</a> have multiple handles:</li> +</ul> <pre><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336 -</code></pre></li> - -<li><p>There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</p></li> - -<li><p>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</p></li> +</code></pre><ul> +<li>There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li> +<li>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</li> </ul> @@ -711,16 +633,13 @@ COPY 54701 Thu, 07 Sep 2017 16:54:52 +0700 https://alanorth.github.io/cgspace-notes/2017-09/ - <h2 id="2017-09-06">2017-09-06</h2> - + <h2 id="20170906">2017-09-06</h2> <ul> <li>Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two hours</li> </ul> - -<h2 id="2017-09-07">2017-09-07</h2> - +<h2 id="20170907">2017-09-07</h2> <ul> -<li>Ask Sisay to clean up the WLE approvers a bit, as Marianne&rsquo;s user account is both in the approvers step as well as the group</li> +<li>Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group</li> </ul> @@ -730,22 +649,21 @@ COPY 54701 Tue, 01 Aug 2017 11:51:52 +0300 https://alanorth.github.io/cgspace-notes/2017-08/ - <h2 id="2017-08-01">2017-08-01</h2> - + <h2 id="20170801">2017-08-01</h2> <ul> <li>Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours</li> <li>I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)</li> <li>The good thing is that, according to <code>dspace.log.2017-08-01</code>, they are all using the same Tomcat session</li> <li>This means our Tomcat Crawler Session Valve is working</li> <li>But many of the bots are browsing dynamic URLs like: - <ul> <li>/handle/10568/3353/discover</li> <li>/handle/10568/16510/browse</li> -</ul></li> +</ul> +</li> <li>The <code>robots.txt</code> only blocks the top-level <code>/discover</code> and <code>/browse</code> URLs&hellip; we will need to find a way to forbid them from accessing these!</li> <li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li> -<li>It turns out that we&rsquo;re already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li> +<li>It turns out that we're already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li> <li>Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;</li> <li>We might actually have to <em>block</em> these requests with HTTP 403 depending on the user agent</li> <li>Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415</li> @@ -761,18 +679,15 @@ COPY 54701 Sat, 01 Jul 2017 18:03:52 +0300 https://alanorth.github.io/cgspace-notes/2017-07/ - <h2 id="2017-07-01">2017-07-01</h2> - + <h2 id="20170701">2017-07-01</h2> <ul> <li>Run system updates and reboot DSpace Test</li> </ul> - -<h2 id="2017-07-04">2017-07-04</h2> - +<h2 id="20170704">2017-07-04</h2> <ul> <li>Merge changes for WLE Phase II theme rename (<a href="https://github.com/ilri/DSpace/pull/329">#329</a>)</li> -<li>Looking at extracting the metadata registries from ICARDA&rsquo;s MEL DSpace database so we can compare fields with CGSpace</li> -<li>We can use PostgreSQL&rsquo;s extended output format (<code>-x</code>) plus <code>sed</code> to format the output into quasi XML:</li> +<li>Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace</li> +<li>We can use PostgreSQL's extended output format (<code>-x</code>) plus <code>sed</code> to format the output into quasi XML:</li> </ul> @@ -782,7 +697,7 @@ COPY 54701 Thu, 01 Jun 2017 10:14:52 +0300 https://alanorth.github.io/cgspace-notes/2017-06/ - 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we&rsquo;ll create a new sub-community for Phase II and create collections for the research themes there The current &ldquo;Research Themes&rdquo; community will be renamed to &ldquo;WLE Phase I Research Themes&rdquo; Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. + 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we'll create a new sub-community for Phase II and create collections for the research themes there The current &ldquo;Research Themes&rdquo; community will be renamed to &ldquo;WLE Phase I Research Themes&rdquo; Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. @@ -791,7 +706,7 @@ COPY 54701 Mon, 01 May 2017 16:21:52 +0200 https://alanorth.github.io/cgspace-notes/2017-05/ - 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it&rsquo;s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire&rsquo;s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. + 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it's a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire's CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. @@ -800,23 +715,18 @@ COPY 54701 Sun, 02 Apr 2017 17:08:52 +0200 https://alanorth.github.io/cgspace-notes/2017-04/ - <h2 id="2017-04-02">2017-04-02</h2> - + <h2 id="20170402">2017-04-02</h2> <ul> <li>Merge one change to CCAFS flagships that I had forgotten to remove last month (&ldquo;MANAGING CLIMATE RISK&rdquo;): <a href="https://github.com/ilri/DSpace/pull/317">https://github.com/ilri/DSpace/pull/317</a></li> <li>Quick proof-of-concept hack to add <code>dc.rights</code> to the input form, including some inline instructions/hints:</li> </ul> - -<p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2017/04/dc-rights.png" alt="dc.rights in the submission form" /></p> - +<p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2017/04/dc-rights.png" alt="dc.rights in the submission form"></p> <ul> <li>Remove redundant/duplicate text in the DSpace submission license</li> - -<li><p>Testing the CMYK patch on a collection with 650 items:</p> - +<li>Testing the CMYK patch on a collection with 650 items:</li> +</ul> <pre><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt -</code></pre></li> -</ul> +</code></pre> @@ -825,14 +735,11 @@ COPY 54701 Wed, 01 Mar 2017 17:08:52 +0200 https://alanorth.github.io/cgspace-notes/2017-03/ - <h2 id="2017-03-01">2017-03-01</h2> - + <h2 id="20170301">2017-03-01</h2> <ul> <li>Run the 279 CIAT author corrections on CGSpace</li> </ul> - -<h2 id="2017-03-02">2017-03-02</h2> - +<h2 id="20170302">2017-03-02</h2> <ul> <li>Skype with Michael and Peter, discussing moving the CGIAR Library to CGSpace</li> <li>CGIAR people possibly open to moving content, redirecting library.cgiar.org to CGSpace and letting CGSpace resolve their handles</li> @@ -842,13 +749,11 @@ COPY 54701 <li>Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI</li> <li>Filed an issue on DSpace issue tracker for the <code>filter-media</code> bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: <a href="https://jira.duraspace.org/browse/DS-3516">DS-3516</a></li> <li>Discovered that the ImageMagic <code>filter-media</code> plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK</li> - -<li><p>Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing regeneration using DSpace 5.x&rsquo;s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999"><sup>10568</sup>&frasl;<sub>51999</sub></a>):</p> - +<li>Interestingly, it seems DSpace 4.x's thumbnails were sRGB, but forcing regeneration using DSpace 5.x's ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999">10568/51999</a>):</li> +</ul> <pre><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000 -</code></pre></li> -</ul> +</code></pre> @@ -857,25 +762,22 @@ COPY 54701 Tue, 07 Feb 2017 07:04:52 -0800 https://alanorth.github.io/cgspace-notes/2017-02/ - <h2 id="2017-02-07">2017-02-07</h2> - + <h2 id="20170207">2017-02-07</h2> <ul> -<li><p>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</p> - +<li>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</li> +</ul> <pre><code>dspace=# select * from collection2item where item_id = '80278'; -id | collection_id | item_id + id | collection_id | item_id -------+---------------+--------- -92551 | 313 | 80278 -92550 | 313 | 80278 -90774 | 1051 | 80278 + 92551 | 313 | 80278 + 92550 | 313 | 80278 + 90774 | 1051 | 80278 (3 rows) dspace=# delete from collection2item where id = 92551 and item_id = 80278; DELETE 1 -</code></pre></li> - -<li><p>Create issue on GitHub to track the addition of CCAFS Phase II project tags (<a href="https://github.com/ilri/DSpace/issues/301">#301</a>)</p></li> - -<li><p>Looks like we&rsquo;ll be using <code>cg.identifier.ccafsprojectpii</code> as the field name</p></li> +</code></pre><ul> +<li>Create issue on GitHub to track the addition of CCAFS Phase II project tags (<a href="https://github.com/ilri/DSpace/issues/301">#301</a>)</li> +<li>Looks like we'll be using <code>cg.identifier.ccafsprojectpii</code> as the field name</li> </ul> @@ -885,12 +787,11 @@ DELETE 1 Mon, 02 Jan 2017 10:43:00 +0300 https://alanorth.github.io/cgspace-notes/2017-01/ - <h2 id="2017-01-02">2017-01-02</h2> - + <h2 id="20170102">2017-01-02</h2> <ul> <li>I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error</li> -<li>I tested on DSpace Test as well and it doesn&rsquo;t work there either</li> -<li>I asked on the dspace-tech mailing list because it seems to be broken, and actually now I&rsquo;m not sure if we&rsquo;ve ever had the sharding task run successfully over all these years</li> +<li>I tested on DSpace Test as well and it doesn't work there either</li> +<li>I asked on the dspace-tech mailing list because it seems to be broken, and actually now I'm not sure if we've ever had the sharding task run successfully over all these years</li> </ul> @@ -900,25 +801,20 @@ DELETE 1 Fri, 02 Dec 2016 10:43:00 +0300 https://alanorth.github.io/cgspace-notes/2016-12/ - <h2 id="2016-12-02">2016-12-02</h2> - + <h2 id="20161202">2016-12-02</h2> <ul> <li>CGSpace was down for five hours in the morning while I was sleeping</li> - -<li><p>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</p> - +<li>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</li> +</ul> <pre><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&quot;dc.title&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&quot;THUMBNAIL&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&quot;-1&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;) -</code></pre></li> - -<li><p>I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade</p></li> - -<li><p>I&rsquo;ve raised a ticket with Atmire to ask</p></li> - -<li><p>Another worrying error from dspace.log is:</p></li> +</code></pre><ul> +<li>I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade</li> +<li>I've raised a ticket with Atmire to ask</li> +<li>Another worrying error from dspace.log is:</li> </ul> @@ -928,13 +824,11 @@ DELETE 1 Tue, 01 Nov 2016 09:21:00 +0300 https://alanorth.github.io/cgspace-notes/2016-11/ - <h2 id="2016-11-01">2016-11-01</h2> - + <h2 id="20161101">2016-11-01</h2> <ul> -<li>Add <code>dc.type</code> to the output options for Atmire&rsquo;s Listings and Reports module (<a href="https://github.com/ilri/DSpace/pull/286">#286</a>)</li> +<li>Add <code>dc.type</code> to the output options for Atmire's Listings and Reports module (<a href="https://github.com/ilri/DSpace/pull/286">#286</a>)</li> </ul> - -<p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/11/listings-and-reports.png" alt="Listings and Reports with output type" /></p> +<p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/11/listings-and-reports.png" alt="Listings and Reports with output type"></p> @@ -943,22 +837,19 @@ DELETE 1 Mon, 03 Oct 2016 15:53:00 +0300 https://alanorth.github.io/cgspace-notes/2016-10/ - <h2 id="2016-10-03">2016-10-03</h2> - + <h2 id="20161003">2016-10-03</h2> <ul> <li>Testing adding <a href="https://wiki.duraspace.org/display/DSDOC5x/ORCID+Integration#ORCIDIntegration-EditingexistingitemsusingBatchCSVEditing">ORCIDs to a CSV</a> file for a single item to see if the author orders get messed up</li> <li>Need to test the following scenarios to see how author order is affected: - <ul> <li>ORCIDs only</li> <li>ORCIDs plus normal authors</li> -</ul></li> - -<li><p>I exported a random item&rsquo;s metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</p> - +</ul> +</li> +<li>I exported a random item's metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</li> +</ul> <pre><code>0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X -</code></pre></li> -</ul> +</code></pre> @@ -967,18 +858,15 @@ DELETE 1 Thu, 01 Sep 2016 15:53:00 +0300 https://alanorth.github.io/cgspace-notes/2016-09/ - <h2 id="2016-09-01">2016-09-01</h2> - + <h2 id="20160901">2016-09-01</h2> <ul> <li>Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors</li> -<li>Discuss how the migration of CGIAR&rsquo;s Active Directory to a flat structure will break our LDAP groups in DSpace</li> +<li>Discuss how the migration of CGIAR's Active Directory to a flat structure will break our LDAP groups in DSpace</li> <li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li> - -<li><p>It looks like we might be able to use OUs now, instead of DCs:</p> - +<li>It looks like we might be able to use OUs now, instead of DCs:</li> +</ul> <pre><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot; -</code></pre></li> -</ul> +</code></pre> @@ -987,22 +875,19 @@ DELETE 1 Mon, 01 Aug 2016 15:53:00 +0300 https://alanorth.github.io/cgspace-notes/2016-08/ - <h2 id="2016-08-01">2016-08-01</h2> - + <h2 id="20160801">2016-08-01</h2> <ul> <li>Add updated distribution license from Sisay (<a href="https://github.com/ilri/DSpace/issues/259">#259</a>)</li> <li>Play with upgrading Mirage 2 dependencies in <code>bower.json</code> because most are several versions of out date</li> <li>Bootstrap is at 3.3.0 but upstream is at 3.3.7, and upgrading to anything beyond 3.3.1 breaks glyphicons and probably more</li> <li>bower stuff is a dead end, waste of time, too many issues</li> <li>Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of <code>fonts</code>)</li> - -<li><p>Start working on DSpace 5.1 → 5.5 port:</p> - +<li>Start working on DSpace 5.1 → 5.5 port:</li> +</ul> <pre><code>$ git checkout -b 55new 5_x-prod $ git reset --hard ilri/5_x-prod $ git rebase -i dspace-5.5 -</code></pre></li> -</ul> +</code></pre> @@ -1011,22 +896,19 @@ $ git rebase -i dspace-5.5 Fri, 01 Jul 2016 10:53:00 +0300 https://alanorth.github.io/cgspace-notes/2016-07/ - <h2 id="2016-07-01">2016-07-01</h2> - + <h2 id="20160701">2016-07-01</h2> <ul> <li>Add <code>dc.description.sponsorship</code> to Discovery sidebar facets and make investors clickable in item view (<a href="https://github.com/ilri/DSpace/issues/232">#232</a>)</li> - -<li><p>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</p> - +<li>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</li> +</ul> <pre><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$'; UPDATE 95 dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$'; -text_value + text_value ------------ (0 rows) -</code></pre></li> - -<li><p>In this case the select query was showing 95 results before the update</p></li> +</code></pre><ul> +<li>In this case the select query was showing 95 results before the update</li> </ul> @@ -1036,11 +918,10 @@ text_value Wed, 01 Jun 2016 10:53:00 +0300 https://alanorth.github.io/cgspace-notes/2016-06/ - <h2 id="2016-06-01">2016-06-01</h2> - + <h2 id="20160601">2016-06-01</h2> <ul> <li>Experimenting with IFPRI OAI (we want to harvest their publications)</li> -<li>After reading the <a href="https://www.oclc.org/support/services/contentdm/help/server-admin-help/oai-support.en.html">ContentDM documentation</a> I found IFPRI&rsquo;s OAI endpoint: <a href="http://ebrary.ifpri.org/oai/oai.php">http://ebrary.ifpri.org/oai/oai.php</a></li> +<li>After reading the <a href="https://www.oclc.org/support/services/contentdm/help/server-admin-help/oai-support.en.html">ContentDM documentation</a> I found IFPRI's OAI endpoint: <a href="http://ebrary.ifpri.org/oai/oai.php">http://ebrary.ifpri.org/oai/oai.php</a></li> <li>After reading the <a href="https://www.openarchives.org/OAI/openarchivesprotocol.html">OAI documentation</a> and testing with an <a href="http://validator.oaipmh.com/">OAI validator</a> I found out how to get their publications</li> <li>This is their publications set: <a href="http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&amp;from=2016-01-01&amp;set=p15738coll2&amp;metadataPrefix=oai_dc">http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&amp;from=2016-01-01&amp;set=p15738coll2&amp;metadataPrefix=oai_dc</a></li> <li>You can see the others by using the OAI <code>ListSets</code> verb: <a href="http://ebrary.ifpri.org/oai/oai.php?verb=ListSets">http://ebrary.ifpri.org/oai/oai.php?verb=ListSets</a></li> @@ -1054,18 +935,15 @@ text_value Sun, 01 May 2016 23:06:00 +0300 https://alanorth.github.io/cgspace-notes/2016-05/ - <h2 id="2016-05-01">2016-05-01</h2> - + <h2 id="20160501">2016-05-01</h2> <ul> <li>Since yesterday there have been 10,000 REST errors and the site has been unstable again</li> <li>I have blocked access to the API now</li> - -<li><p>There are 3,000 IPs accessing the REST API in a 24-hour period!</p> - +<li>There are 3,000 IPs accessing the REST API in a 24-hour period!</li> +</ul> <pre><code># awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l 3168 -</code></pre></li> -</ul> +</code></pre> @@ -1074,13 +952,12 @@ text_value Mon, 04 Apr 2016 11:06:00 +0300 https://alanorth.github.io/cgspace-notes/2016-04/ - <h2 id="2016-04-04">2016-04-04</h2> - + <h2 id="20160404">2016-04-04</h2> <ul> <li>Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit</li> <li>We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc</li> -<li>After running DSpace for over five years I&rsquo;ve never needed to look in any other log file than dspace.log, leave alone one from last year!</li> -<li>This will save us a few gigs of backup space we&rsquo;re paying for on S3</li> +<li>After running DSpace for over five years I've never needed to look in any other log file than dspace.log, leave alone one from last year!</li> +<li>This will save us a few gigs of backup space we're paying for on S3</li> <li>Also, I noticed the <code>checker</code> log has some errors we should pay attention to:</li> </ul> @@ -1091,11 +968,10 @@ text_value Wed, 02 Mar 2016 16:50:00 +0300 https://alanorth.github.io/cgspace-notes/2016-03/ - <h2 id="2016-03-02">2016-03-02</h2> - + <h2 id="20160302">2016-03-02</h2> <ul> <li>Looking at issues with author authorities on CGSpace</li> -<li>For some reason we still have the <code>index-lucene-update</code> cron job active on CGSpace, but I&rsquo;m pretty sure we don&rsquo;t need it as of the latest few versions of Atmire&rsquo;s Listings and Reports module</li> +<li>For some reason we still have the <code>index-lucene-update</code> cron job active on CGSpace, but I'm pretty sure we don't need it as of the latest few versions of Atmire's Listings and Reports module</li> <li>Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server</li> </ul> @@ -1106,16 +982,13 @@ text_value Fri, 05 Feb 2016 13:18:00 +0300 https://alanorth.github.io/cgspace-notes/2016-02/ - <h2 id="2016-02-05">2016-02-05</h2> - + <h2 id="20160205">2016-02-05</h2> <ul> <li>Looking at some DAGRIS data for Abenet Yabowork</li> <li>Lots of issues with spaces, newlines, etc causing the import to fail</li> <li>I noticed we have a very <em>interesting</em> list of countries on CGSpace:</li> </ul> - -<p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/02/cgspace-countries.png" alt="CGSpace country list" /></p> - +<p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/02/cgspace-countries.png" alt="CGSpace country list"></p> <ul> <li>Not only are there 49,000 countries, we have some blanks (25)&hellip;</li> <li>Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&rdquo;</li> @@ -1128,8 +1001,7 @@ text_value Wed, 13 Jan 2016 13:18:00 +0300 https://alanorth.github.io/cgspace-notes/2016-01/ - <h2 id="2016-01-13">2016-01-13</h2> - + <h2 id="20160113">2016-01-13</h2> <ul> <li>Move ILRI collection <code>10568/12503</code> from <code>10568/27869</code> to <code>10568/27629</code> using the <a href="https://gist.github.com/alanorth/392c4660e8b022d99dfa">move_collections.sh</a> script I wrote last year.</li> <li>I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.</li> @@ -1143,18 +1015,16 @@ text_value Wed, 02 Dec 2015 13:18:00 +0300 https://alanorth.github.io/cgspace-notes/2015-12/ - <h2 id="2015-12-02">2015-12-02</h2> - + <h2 id="20151202">2015-12-02</h2> <ul> -<li><p>Replace <code>lzop</code> with <code>xz</code> in log compression cron jobs on DSpace Test—it uses less space:</p> - +<li>Replace <code>lzop</code> with <code>xz</code> in log compression cron jobs on DSpace Test—it uses less space:</li> +</ul> <pre><code># cd /home/dspacetest.cgiar.org/log # ls -lh dspace.log.2015-11-18* -rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18 -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo -rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz -</code></pre></li> -</ul> +</code></pre> @@ -1163,18 +1033,15 @@ text_value Mon, 23 Nov 2015 17:00:57 +0300 https://alanorth.github.io/cgspace-notes/2015-11/ - <h2 id="2015-11-22">2015-11-22</h2> - + <h2 id="20151122">2015-11-22</h2> <ul> <li>CGSpace went down</li> <li>Looks like DSpace exhausted its PostgreSQL connection pool</li> - -<li><p>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</p> - +<li>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</li> +</ul> <pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace 78 -</code></pre></li> -</ul> +</code></pre> diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index 78e3d9bd5..df99907a7 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -9,13 +9,12 @@ - - + @@ -100,40 +99,34 @@

    -

    2019-02-01

    - +

    2019-02-01

    • Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
    • - -
    • The top IPs before, during, and after this latest alert tonight were:

      - +
    • The top IPs before, during, and after this latest alert tonight were:
    • +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -245 207.46.13.5
    -332 54.70.40.11
    -385 5.143.231.38
    -405 207.46.13.173
    -405 207.46.13.75
    -1117 66.249.66.219
    -1121 35.237.175.180
    -1546 5.9.6.51
    -2474 45.5.186.2
    -5490 85.25.237.71
    -
    - -
  • 85.25.237.71 is the “Linguee Bot” that I first saw last month

  • - -
  • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase

  • - -
  • There were just over 3 million accesses in the nginx logs last month:

    - + 245 207.46.13.5 + 332 54.70.40.11 + 385 5.143.231.38 + 405 207.46.13.173 + 405 207.46.13.75 + 1117 66.249.66.219 + 1121 35.237.175.180 + 1546 5.9.6.51 + 2474 45.5.186.2 + 5490 85.25.237.71 +
      +
    • 85.25.237.71 is the “Linguee Bot” that I first saw last month
    • +
    • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
    • +
    • There were just over 3 million accesses in the nginx logs last month:
    • +
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
     3018243
     
     real    0m19.873s
     user    0m22.203s
     sys     0m1.979s
    -
  • - + Read more → @@ -151,26 +144,23 @@ sys 0m1.979s

    -

    2019-01-02

    - +

    2019-01-02

    • Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
    • - -
    • I don’t see anything interesting in the web server logs around that time though:

      - -
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      - 92 40.77.167.4
      - 99 210.7.29.100
      -120 38.126.157.45
      -177 35.237.175.180
      -177 40.77.167.32
      -216 66.249.75.219
      -225 18.203.76.93
      -261 46.101.86.248
      -357 207.46.13.1
      -903 54.70.40.11
      -
    • +
    • I don't see anything interesting in the web server logs around that time though:
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +     92 40.77.167.4
    +     99 210.7.29.100
    +    120 38.126.157.45
    +    177 35.237.175.180
    +    177 40.77.167.32
    +    216 66.249.75.219
    +    225 18.203.76.93
    +    261 46.101.86.248
    +    357 207.46.13.1
    +    903 54.70.40.11
    +
    Read more → @@ -188,16 +178,13 @@ sys 0m1.979s

    -

    2018-12-01

    - +

    2018-12-01

    • Switch CGSpace (linode18) to use OpenJDK instead of Oracle JDK
    • I manually installed OpenJDK, then removed Oracle JDK, then re-ran the Ansible playbook to update all configuration files, etc
    • Then I ran all system updates and restarted the server
    - -

    2018-12-02

    - +

    2018-12-02

    @@ -218,15 +205,12 @@ sys 0m1.979s

    -

    2018-11-01

    - +

    2018-11-01

    • Finalize AReS Phase I and Phase II ToRs
    • Send a note about my dspace-statistics-api to the dspace-tech mailing list
    - -

    2018-11-03

    - +

    2018-11-03

    • Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage
    • Today these are the top 10 IPs:
    • @@ -248,11 +232,10 @@ sys 0m1.979s

      -

      2018-10-01

      - +

      2018-10-01

      • Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
      • -
      • I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now
      • +
      • I created a GitHub issue to track this #389, because I'm super busy in Nairobi right now
      Read more → @@ -271,13 +254,12 @@ sys 0m1.979s

      -

      2018-09-02

      - +

      2018-09-02

      • New PostgreSQL JDBC driver version 42.2.5
      • -
      • I’ll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
      • -
      • Also, I’ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month
      • -
      • I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again:
      • +
      • I'll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
      • +
      • Also, I'll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month
      • +
      • I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:
      Read more → @@ -296,27 +278,20 @@ sys 0m1.979s

      -

      2018-08-01

      - +

      2018-08-01

        -
      • DSpace Test had crashed at some point yesterday morning and I see the following in dmesg:

        - +
      • DSpace Test had crashed at some point yesterday morning and I see the following in dmesg:
      • +
      [Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
       [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
       [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      -
      - -
    • Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight

    • - -
    • From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat’s

    • - -
    • I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…

    • - -
    • Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core

    • - -
    • The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes

    • - -
    • I ran all system updates on DSpace Test and rebooted it

    • +
        +
      • Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
      • +
      • From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat's
      • +
      • I'm not sure why Tomcat didn't crash with an OutOfMemoryError…
      • +
      • Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
      • +
      • The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes
      • +
      • I ran all system updates on DSpace Test and rebooted it
      Read more → @@ -335,19 +310,16 @@ sys 0m1.979s

      -

      2018-07-01

      - +

      2018-07-01

        -
      • I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:

        - -
        $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
        -
      • - -
      • During the mvn package stage on the 5.8 branch I kept getting issues with java running out of memory:

        - -
        There is insufficient memory for the Java Runtime Environment to continue.
        -
      • +
      • I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:
      +
      $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
      +
        +
      • During the mvn package stage on the 5.8 branch I kept getting issues with java running out of memory:
      • +
      +
      There is insufficient memory for the Java Runtime Environment to continue.
      +
      Read more → @@ -365,32 +337,27 @@ sys 0m1.979s

      -

      2018-06-04

      - +

      2018-06-04

      • Test the DSpace 5.8 module upgrades from Atmire (#378) -
          -
        • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn’t build
        • -
      • +
      • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn't build
      • +
      +
    • I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
    • - -
    • I proofed and tested the ILRI author corrections that Peter sent back to me this week:

      - +
    • I proofed and tested the ILRI author corrections that Peter sent back to me this week:
    • +
    $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
    -
    - -
  • I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018

  • - -
  • Time to index ~70,000 items on CGSpace:

    - +
      +
    • I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018
    • +
    • Time to index ~70,000 items on CGSpace:
    • +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
     
     real    74m42.646s
     user    8m5.056s
     sys     2m7.289s
    -
  • - + Read more → @@ -408,15 +375,14 @@ sys 2m7.289s

    -

    2018-05-01

    - +

    2018-05-01

    +
  • Then I reduced the JVM heap size from 6144 back to 5120m
  • Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the Ansible infrastructure scripts to support hosts choosing which distribution they want to use
  • diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index 0659a1ac3..bba27dbbd 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -9,13 +9,12 @@ - - + @@ -100,10 +99,9 @@

    -

    2018-04-01

    - +

    2018-04-01

      -
    • I tried to test something on DSpace Test but noticed that it’s down since god knows when
    • +
    • I tried to test something on DSpace Test but noticed that it's down since god knows when
    • Catalina logs at least show some memory errors yesterday:
    Read more → @@ -123,8 +121,7 @@

    -

    2018-03-02

    - +

    2018-03-02

    • Export a CSV of the IITA community metadata for Martin Mueller
    @@ -145,13 +142,12 @@

    -

    2018-02-01

    - +

    2018-02-01

    • Peter gave feedback on the dc.rights proof of concept that I had sent him last week
    • -
    • We don’t need to distinguish between internal and external works, so that makes it just a simple list
    • +
    • We don't need to distinguish between internal and external works, so that makes it just a simple list
    • Yesterday I figured out how to monitor DSpace sessions using JMX
    • -
    • I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
    • +
    • I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu's munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
    Read more → @@ -170,33 +166,26 @@

    -

    2018-01-02

    - +

    2018-01-02

    • Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time
    • -
    • I didn’t get any load alerts from Linode and the REST and XMLUI logs don’t show anything out of the ordinary
    • +
    • I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary
    • The nginx logs show HTTP 200s until 02/Jan/2018:11:27:17 +0000 when Uptime Robot got an HTTP 500
    • In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
    • - -
    • And just before that I see this:

      - +
    • And just before that I see this:
    • +
    Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
    -
    - -
  • Ah hah! So the pool was actually empty!

  • - -
  • I need to increase that, let’s try to bump it up from 50 to 75

  • - -
  • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw

  • - -
  • I notice this error quite a few times in dspace.log:

    - +
      +
    • Ah hah! So the pool was actually empty!
    • +
    • I need to increase that, let's try to bump it up from 50 to 75
    • +
    • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw
    • +
    • I notice this error quite a few times in dspace.log:
    • +
    2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
     org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.
    -
  • - -
  • And there are many of these errors every day for the past month:

    - +
      +
    • And there are many of these errors every day for the past month:
    • +
    $ grep -c "Error while searching for sidebar facets" dspace.log.*
     dspace.log.2017-11-21:4
     dspace.log.2017-11-22:1
    @@ -241,9 +230,8 @@ dspace.log.2017-12-30:89
     dspace.log.2017-12-31:53
     dspace.log.2018-01-01:45
     dspace.log.2018-01-02:34
    -
  • - -
  • Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains

  • +
      +
    • Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains
    Read more → @@ -262,8 +250,7 @@ dspace.log.2018-01-02:34

    -

    2017-12-01

    - +

    2017-12-01

    • Uptime Robot noticed that CGSpace went down
    • The logs say “Timeout waiting for idle object”
    • @@ -287,27 +274,22 @@ dspace.log.2018-01-02:34

      -

      2017-11-01

      - +

      2017-11-01

      • The CORE developers responded to say they are looking into their bot not respecting our robots.txt
      - -

      2017-11-02

      - +

      2017-11-02

        -
      • Today there have been no hits by CORE and no alerts from Linode (coincidence?)

        - +
      • Today there have been no hits by CORE and no alerts from Linode (coincidence?)
      • +
      # grep -c "CORE" /var/log/nginx/access.log
       0
      -
      - -
    • Generate list of authors on CGSpace for Peter to go through and correct:

      - +
        +
      • Generate list of authors on CGSpace for Peter to go through and correct:
      • +
      dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
       COPY 54701
      -
    • -
    + Read more → @@ -325,17 +307,14 @@ COPY 54701

    -

    2017-10-01

    - +

    2017-10-01

    http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
    -
    - -
  • There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine

  • - -
  • Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections

  • +
      +
    • There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
    • +
    • Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
    Read more → @@ -374,16 +353,13 @@ COPY 54701

    -

    2017-09-06

    - +

    2017-09-06

    • Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two hours
    - -

    2017-09-07

    - +

    2017-09-07

      -
    • Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group
    • +
    • Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group
    Read more → @@ -402,22 +378,21 @@ COPY 54701

    -

    2017-08-01

    - +

    2017-08-01

    • Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
    • I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
    • The good thing is that, according to dspace.log.2017-08-01, they are all using the same Tomcat session
    • This means our Tomcat Crawler Session Valve is working
    • But many of the bots are browsing dynamic URLs like: -
      • /handle/10568/3353/discover
      • /handle/10568/16510/browse
      • -
    • +
    +
  • The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these!
  • Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
  • -
  • It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
  • +
  • It turns out that we're already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
  • Also, the bot has to successfully browse the page first so it can receive the HTTP header…
  • We might actually have to block these requests with HTTP 403 depending on the user agent
  • Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
  • diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index a4eadce4e..dbb8e125f 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -9,13 +9,12 @@ - - + @@ -100,18 +99,15 @@

    -

    2017-07-01

    - +

    2017-07-01

    • Run system updates and reboot DSpace Test
    - -

    2017-07-04

    - +

    2017-07-04

    • Merge changes for WLE Phase II theme rename (#329)
    • -
    • Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace
    • -
    • We can use PostgreSQL’s extended output format (-x) plus sed to format the output into quasi XML:
    • +
    • Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace
    • +
    • We can use PostgreSQL's extended output format (-x) plus sed to format the output into quasi XML:
    Read more → @@ -130,7 +126,7 @@

    - 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we’ll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. + 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we'll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. Read more → @@ -148,7 +144,7 @@

    - 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it’s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire’s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. + 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it's a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire's CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. Read more → @@ -166,23 +162,18 @@

    -

    2017-04-02

    - +

    2017-04-02

    • Merge one change to CCAFS flagships that I had forgotten to remove last month (“MANAGING CLIMATE RISK”): https://github.com/ilri/DSpace/pull/317
    • Quick proof-of-concept hack to add dc.rights to the input form, including some inline instructions/hints:
    - -

    dc.rights in the submission form

    - +

    dc.rights in the submission form

    • Remove redundant/duplicate text in the DSpace submission license
    • - -
    • Testing the CMYK patch on a collection with 650 items:

      - -
      $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
      -
    • +
    • Testing the CMYK patch on a collection with 650 items:
    +
    $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
    +
    Read more → @@ -200,14 +191,11 @@

    -

    2017-03-01

    - +

    2017-03-01

    • Run the 279 CIAT author corrections on CGSpace
    - -

    2017-03-02

    - +

    2017-03-02

    • Skype with Michael and Peter, discussing moving the CGIAR Library to CGSpace
    • CGIAR people possibly open to moving content, redirecting library.cgiar.org to CGSpace and letting CGSpace resolve their handles
    • @@ -217,13 +205,11 @@
    • Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
    • Filed an issue on DSpace issue tracker for the filter-media bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
    • Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
    • - -
    • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 1056851999):

      - +
    • Interestingly, it seems DSpace 4.x's thumbnails were sRGB, but forcing regeneration using DSpace 5.x's ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
    • +
    $ identify ~/Desktop/alc_contrastes_desafios.jpg
     /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
    -
    - + Read more → @@ -241,25 +227,22 @@

    -

    2017-02-07

    - +

    2017-02-07

      -
    • An item was mapped twice erroneously again, so I had to remove one of the mappings manually:

      - +
    • An item was mapped twice erroneously again, so I had to remove one of the mappings manually:
    • +
    dspace=# select * from collection2item where item_id = '80278';
    -id   | collection_id | item_id
    +  id   | collection_id | item_id
     -------+---------------+---------
    -92551 |           313 |   80278
    -92550 |           313 |   80278
    -90774 |          1051 |   80278
    + 92551 |           313 |   80278
    + 92550 |           313 |   80278
    + 90774 |          1051 |   80278
     (3 rows)
     dspace=# delete from collection2item where id = 92551 and item_id = 80278;
     DELETE 1
    -
    - -
  • Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)

  • - -
  • Looks like we’ll be using cg.identifier.ccafsprojectpii as the field name

  • +
      +
    • Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
    • +
    • Looks like we'll be using cg.identifier.ccafsprojectpii as the field name
    Read more → @@ -278,12 +261,11 @@ DELETE 1

    -

    2017-01-02

    - +

    2017-01-02

    • I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error
    • -
    • I tested on DSpace Test as well and it doesn’t work there either
    • -
    • I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years
    • +
    • I tested on DSpace Test as well and it doesn't work there either
    • +
    • I asked on the dspace-tech mailing list because it seems to be broken, and actually now I'm not sure if we've ever had the sharding task run successfully over all these years
    Read more → @@ -302,25 +284,20 @@ DELETE 1

    -

    2016-12-02

    - +

    2016-12-02

    • CGSpace was down for five hours in the morning while I was sleeping
    • - -
    • While looking in the logs for errors, I see tons of warnings about Atmire MQM:

      - +
    • While looking in the logs for errors, I see tons of warnings about Atmire MQM:
    • +
    2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
    -
    - -
  • I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade

  • - -
  • I’ve raised a ticket with Atmire to ask

  • - -
  • Another worrying error from dspace.log is:

  • +
      +
    • I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade
    • +
    • I've raised a ticket with Atmire to ask
    • +
    • Another worrying error from dspace.log is:
    Read more → @@ -339,13 +316,11 @@ DELETE 1

    -

    2016-11-01

    - +

    2016-11-01

      -
    • Add dc.type to the output options for Atmire’s Listings and Reports module (#286)
    • +
    • Add dc.type to the output options for Atmire's Listings and Reports module (#286)
    - -

    Listings and Reports with output type

    +

    Listings and Reports with output type

    Read more → @@ -363,22 +338,19 @@ DELETE 1

    -

    2016-10-03

    - +

    2016-10-03

    • Testing adding ORCIDs to a CSV file for a single item to see if the author orders get messed up
    • Need to test the following scenarios to see how author order is affected: -
      • ORCIDs only
      • ORCIDs plus normal authors
      • -
    • - -
    • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:

      - -
      0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
      -
    + +
  • I exported a random item's metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
  • + +
    0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
    +
    Read more → diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 99c8b0b0d..837ceb026 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -9,13 +9,12 @@ - - + @@ -100,18 +99,15 @@

    -

    2016-09-01

    - +

    2016-09-01

    • Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors
    • -
    • Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace
    • +
    • Discuss how the migration of CGIAR's Active Directory to a flat structure will break our LDAP groups in DSpace
    • We had been using DC=ILRI to determine whether a user was ILRI or not
    • - -
    • It looks like we might be able to use OUs now, instead of DCs:

      - -
      $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
      -
    • +
    • It looks like we might be able to use OUs now, instead of DCs:
    +
    $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
    +
    Read more → @@ -129,22 +125,19 @@

    -

    2016-08-01

    - +

    2016-08-01

    • Add updated distribution license from Sisay (#259)
    • Play with upgrading Mirage 2 dependencies in bower.json because most are several versions of out date
    • Bootstrap is at 3.3.0 but upstream is at 3.3.7, and upgrading to anything beyond 3.3.1 breaks glyphicons and probably more
    • bower stuff is a dead end, waste of time, too many issues
    • Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of fonts)
    • - -
    • Start working on DSpace 5.1 → 5.5 port:

      - +
    • Start working on DSpace 5.1 → 5.5 port:
    • +
    $ git checkout -b 55new 5_x-prod
     $ git reset --hard ilri/5_x-prod
     $ git rebase -i dspace-5.5
    -
    - + Read more → @@ -162,22 +155,19 @@ $ git rebase -i dspace-5.5

    -

    2016-07-01

    - +

    2016-07-01

    • Add dc.description.sponsorship to Discovery sidebar facets and make investors clickable in item view (#232)
    • - -
    • I think this query should find and replace all authors that have “,” at the end of their names:

      - +
    • I think this query should find and replace all authors that have “,” at the end of their names:
    • +
    dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
     UPDATE 95
     dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
    -text_value
    + text_value
     ------------
     (0 rows)
    -
    - -
  • In this case the select query was showing 95 results before the update

  • +
      +
    • In this case the select query was showing 95 results before the update
    Read more → @@ -196,11 +186,10 @@ text_value

    -

    2016-06-01

    - +

    2016-06-01

    + Read more → @@ -252,13 +238,12 @@ text_value

    -

    2016-04-04

    - +

    2016-04-04

    • Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit
    • We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc
    • -
    • After running DSpace for over five years I’ve never needed to look in any other log file than dspace.log, leave alone one from last year!
    • -
    • This will save us a few gigs of backup space we’re paying for on S3
    • +
    • After running DSpace for over five years I've never needed to look in any other log file than dspace.log, leave alone one from last year!
    • +
    • This will save us a few gigs of backup space we're paying for on S3
    • Also, I noticed the checker log has some errors we should pay attention to:
    Read more → @@ -278,11 +263,10 @@ text_value

    -

    2016-03-02

    - +

    2016-03-02

    • Looking at issues with author authorities on CGSpace
    • -
    • For some reason we still have the index-lucene-update cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module
    • +
    • For some reason we still have the index-lucene-update cron job active on CGSpace, but I'm pretty sure we don't need it as of the latest few versions of Atmire's Listings and Reports module
    • Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server
    Read more → @@ -302,16 +286,13 @@ text_value

    -

    2016-02-05

    - +

    2016-02-05

    • Looking at some DAGRIS data for Abenet Yabowork
    • Lots of issues with spaces, newlines, etc causing the import to fail
    • I noticed we have a very interesting list of countries on CGSpace:
    - -

    CGSpace country list

    - +

    CGSpace country list

    • Not only are there 49,000 countries, we have some blanks (25)…
    • Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE”
    • @@ -333,8 +314,7 @@ text_value

      -

      2016-01-13

      - +

      2016-01-13

      • Move ILRI collection 10568/12503 from 10568/27869 to 10568/27629 using the move_collections.sh script I wrote last year.
      • I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.
      • @@ -357,18 +337,16 @@ text_value

        -

        2015-12-02

        - +

        2015-12-02

          -
        • Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less space:

          - +
        • Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less space:
        • +
        # cd /home/dspacetest.cgiar.org/log
         # ls -lh dspace.log.2015-11-18*
         -rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
         -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
         -rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz
        -
        -
      + Read more → diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index 7a93e7271..e551e3882 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -9,13 +9,12 @@ - - + @@ -100,18 +99,15 @@

      -

      2015-11-22

      - +

      2015-11-22

      • CGSpace went down
      • Looks like DSpace exhausted its PostgreSQL connection pool
      • - -
      • Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:

        - +
      • Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:
      • +
      $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
       78
      -
      -
    + Read more → diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 9f8193bbd..314acf0f1 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,27 +4,27 @@ https://alanorth.github.io/cgspace-notes/categories/ - 2019-11-26T15:53:57+02:00 + 2019-11-27T14:56:00+02:00 https://alanorth.github.io/cgspace-notes/ - 2019-11-26T15:53:57+02:00 + 2019-11-27T14:56:00+02:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2019-11-26T15:53:57+02:00 + 2019-11-27T14:56:00+02:00 https://alanorth.github.io/cgspace-notes/2019-11/ - 2019-11-26T15:53:57+02:00 + 2019-11-27T14:56:00+02:00 https://alanorth.github.io/cgspace-notes/posts/ - 2019-11-26T15:53:57+02:00 + 2019-11-27T14:56:00+02:00 diff --git a/docs/tags/index.html b/docs/tags/index.html index 57faad9f2..b6107b638 100644 --- a/docs/tags/index.html +++ b/docs/tags/index.html @@ -9,13 +9,12 @@ - - + @@ -100,31 +99,27 @@

    -

    2019-11-04

    - +

    2019-11-04

      -
    • Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics

      - +
    • Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics
        -
      • I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:

        - +
      • I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:
      • +
      +
    • +
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
     4671942
     # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
     1277694
    -
    - - -
  • So 4.6 million from XMLUI and another 1.2 million from API requests

  • - -
  • Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):

    - +
      +
    • So 4.6 million from XMLUI and another 1.2 million from API requests
    • +
    • Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
    • +
    # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
     1183456 
     # zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
     106781
    -
  • - + Read more → @@ -145,7 +140,6 @@

    Possible changes to CGSpace metadata fields to align more with DC, QDC, and DCTERMS as well as CG Core v2.

    -

    With reference to CG Core v2 draft standard by Marie-Angélique as well as DCMI DCTERMS.

    Read more → @@ -164,8 +158,7 @@

    - 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace - I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: + 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's “unneccesary Unicode” fix: $ csvcut -c 'id,dc. Read more → @@ -183,37 +176,34 @@

    -

    2019-09-01

    - +

    2019-09-01

    • Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning
    • - -
    • Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:

      - -
      # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      -440 17.58.101.255
      -441 157.55.39.101
      -485 207.46.13.43
      -728 169.60.128.125
      -730 207.46.13.108
      -758 157.55.39.9
      -808 66.160.140.179
      -814 207.46.13.212
      -2472 163.172.71.23
      -6092 3.94.211.189
      -# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      - 33 2a01:7e00::f03c:91ff:fe16:fcb
      - 57 3.83.192.124
      - 57 3.87.77.25
      - 57 54.82.1.8
      -822 2a01:9cc0:47:1:1a:4:0:2
      -1223 45.5.184.72
      -1633 172.104.229.92
      -5112 205.186.128.185
      -7249 2a01:7e00::f03c:91ff:fe18:7396
      -9124 45.5.186.2
      -
    • +
    • Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
    +
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    440 17.58.101.255
    +    441 157.55.39.101
    +    485 207.46.13.43
    +    728 169.60.128.125
    +    730 207.46.13.108
    +    758 157.55.39.9
    +    808 66.160.140.179
    +    814 207.46.13.212
    +   2472 163.172.71.23
    +   6092 3.94.211.189
    +# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +     33 2a01:7e00::f03c:91ff:fe16:fcb
    +     57 3.83.192.124
    +     57 3.87.77.25
    +     57 54.82.1.8
    +    822 2a01:9cc0:47:1:1a:4:0:2
    +   1223 45.5.184.72
    +   1633 172.104.229.92
    +   5112 205.186.128.185
    +   7249 2a01:7e00::f03c:91ff:fe18:7396
    +   9124 45.5.186.2
    +
    Read more → @@ -231,22 +221,19 @@

    -

    2019-08-03

    - +

    2019-08-03

      -
    • Look at Bioversity’s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…
    • +
    • Look at Bioversity's latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…
    - -

    2019-08-04

    - +

    2019-08-04

    • Deploy ORCID identifier updates requested by Bioversity to CGSpace
    • Run system updates on CGSpace (linode18) and reboot it -
      • Before updating it I checked Solr and verified that all statistics cores were loaded properly…
      • -
      • After rebooting, all statistics cores were loaded… wow, that’s lucky.
      • -
    • +
    • After rebooting, all statistics cores were loaded… wow, that's lucky.
    • +
    +
  • Run system updates on DSpace Test (linode19) and reboot it
  • Read more → @@ -266,16 +253,15 @@

    -

    2019-07-01

    - +

    2019-07-01

    • Create an “AfricaRice books and book chapters” collection on CGSpace for AfricaRice
    • Last month Sisay asked why the following “most popular” statistics link for a range of months in 2018 works for the CIAT community on DSpace Test, but not on CGSpace: -
    • +
    +
  • Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community
  • Read more → @@ -295,15 +281,12 @@

    -

    2019-06-02

    - +

    2019-06-02

    - -

    2019-06-03

    - +

    2019-06-03

    • Skype with Marie-Angélique and Abenet about CG Core v2
    @@ -324,24 +307,21 @@

    -

    2019-05-01

    - +

    2019-05-01

    • Help CCAFS with regenerating some item thumbnails after they uploaded new PDFs to some items on CGSpace
    • A user on the dspace-tech mailing list offered some suggestions for troubleshooting the problem with the inability to delete certain items -
      • Apparently if the item is in the workflowitem table it is submitted to a workflow
      • And if it is in the workspaceitem table it is in the pre-submitted state
      • -
    • - -
    • The item seems to be in a pre-submitted state, so I tried to delete it from there:

      - +
    + +
  • The item seems to be in a pre-submitted state, so I tried to delete it from there:
  • +
    dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
     DELETE 1
    -
    - -
  • But after this I tried to delete the item from the XMLUI and it is still present…

  • +
      +
    • But after this I tried to delete the item from the XMLUI and it is still present…
    Read more → @@ -360,35 +340,30 @@ DELETE 1

    -

    2019-04-01

    - +

    2019-04-01

    • Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc -
      • They asked if we had plans to enable RDF support in CGSpace
      • -
    • - -
    • There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today

      - +
    + +
  • There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today
      -
    • I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!

      - +
    • I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!
    • +
    +
  • +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
    -4432 200
    -
    - - -
  • In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses

  • - -
  • Apply country and region corrections and deletions on DSpace Test and CGSpace:

    - + 4432 200 +
      +
    • In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
    • +
    • Apply country and region corrections and deletions on DSpace Test and CGSpace:
    • +
    $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
     $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
    -
  • - + Read more → @@ -406,20 +381,19 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace

    -

    2019-03-01

    - +

    2019-03-01

      -
    • I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
    • +
    • I checked IITA's 259 Feb 14 records from last month for duplicates using Atmire's Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
    • I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…
    • -
    • Looking at the other half of Udana’s WLE records from 2018-11 - +
    • Looking at the other half of Udana's WLE records from 2018-11
      • I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)
      • I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items
      • Most worryingly, there are encoding errors in the abstracts for eleven items, for example:
      • 68.15% � 9.45 instead of 68.15% ± 9.45
      • 2003�2013 instead of 2003–2013
      • -
    • +
    +
  • I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs
  • Read more → diff --git a/docs/tags/migration/index.html b/docs/tags/migration/index.html index 636688329..da30d154e 100644 --- a/docs/tags/migration/index.html +++ b/docs/tags/migration/index.html @@ -9,13 +9,12 @@ - - + @@ -88,7 +87,6 @@

    Possible changes to CGSpace metadata fields to align more with DC, QDC, and DCTERMS as well as CG Core v2.

    -

    With reference to CG Core v2 draft standard by Marie-Angélique as well as DCMI DCTERMS.

    Read more → diff --git a/docs/tags/migration/index.xml b/docs/tags/migration/index.xml index c3eedaddf..8e7c95901 100644 --- a/docs/tags/migration/index.xml +++ b/docs/tags/migration/index.xml @@ -18,7 +18,6 @@ https://alanorth.github.io/cgspace-notes/cgspace-cgcorev2-migration/ <p>Possible changes to CGSpace metadata fields to align more with DC, QDC, and DCTERMS as well as CG Core v2.</p> - <p>With reference to <a href="https://agriculturalsemantics.github.io/cg-core/cgcore.html">CG Core v2 draft standard</a> by Marie-Angélique as well as <a href="http://www.dublincore.org/specifications/dublin-core/dcmi-terms/">DCMI DCTERMS</a>.</p> diff --git a/docs/tags/notes/index.html b/docs/tags/notes/index.html index 0bab7dbbb..cf0ed84d7 100644 --- a/docs/tags/notes/index.html +++ b/docs/tags/notes/index.html @@ -9,13 +9,12 @@ - - + @@ -85,16 +84,13 @@

    -

    2017-09-06

    - +

    2017-09-06

    • Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two hours
    - -

    2017-09-07

    - +

    2017-09-07

      -
    • Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group
    • +
    • Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group
    Read more → @@ -113,22 +109,21 @@

    -

    2017-08-01

    - +

    2017-08-01

    • Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
    • I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
    • The good thing is that, according to dspace.log.2017-08-01, they are all using the same Tomcat session
    • This means our Tomcat Crawler Session Valve is working
    • But many of the bots are browsing dynamic URLs like: -
      • /handle/10568/3353/discover
      • /handle/10568/16510/browse
      • -
    • +
    +
  • The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these!
  • Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
  • -
  • It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
  • +
  • It turns out that we're already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
  • Also, the bot has to successfully browse the page first so it can receive the HTTP header…
  • We might actually have to block these requests with HTTP 403 depending on the user agent
  • Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
  • @@ -153,18 +148,15 @@

    -

    2017-07-01

    - +

    2017-07-01

    • Run system updates and reboot DSpace Test
    - -

    2017-07-04

    - +

    2017-07-04

    • Merge changes for WLE Phase II theme rename (#329)
    • -
    • Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace
    • -
    • We can use PostgreSQL’s extended output format (-x) plus sed to format the output into quasi XML:
    • +
    • Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace
    • +
    • We can use PostgreSQL's extended output format (-x) plus sed to format the output into quasi XML:
    Read more → @@ -183,7 +175,7 @@

    - 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we’ll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. + 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we'll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. Read more → @@ -201,7 +193,7 @@

    - 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it’s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire’s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. + 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it's a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire's CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. Read more → @@ -219,23 +211,18 @@

    -

    2017-04-02

    - +

    2017-04-02

    • Merge one change to CCAFS flagships that I had forgotten to remove last month (“MANAGING CLIMATE RISK”): https://github.com/ilri/DSpace/pull/317
    • Quick proof-of-concept hack to add dc.rights to the input form, including some inline instructions/hints:
    - -

    dc.rights in the submission form

    - +

    dc.rights in the submission form

    • Remove redundant/duplicate text in the DSpace submission license
    • - -
    • Testing the CMYK patch on a collection with 650 items:

      - -
      $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
      -
    • +
    • Testing the CMYK patch on a collection with 650 items:
    +
    $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
    +
    Read more → @@ -253,14 +240,11 @@

    -

    2017-03-01

    - +

    2017-03-01

    • Run the 279 CIAT author corrections on CGSpace
    - -

    2017-03-02

    - +

    2017-03-02

    • Skype with Michael and Peter, discussing moving the CGIAR Library to CGSpace
    • CGIAR people possibly open to moving content, redirecting library.cgiar.org to CGSpace and letting CGSpace resolve their handles
    • @@ -270,13 +254,11 @@
    • Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
    • Filed an issue on DSpace issue tracker for the filter-media bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
    • Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
    • - -
    • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 1056851999):

      - +
    • Interestingly, it seems DSpace 4.x's thumbnails were sRGB, but forcing regeneration using DSpace 5.x's ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
    • +
    $ identify ~/Desktop/alc_contrastes_desafios.jpg
     /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
    -
    - + Read more → @@ -294,25 +276,22 @@

    -

    2017-02-07

    - +

    2017-02-07

      -
    • An item was mapped twice erroneously again, so I had to remove one of the mappings manually:

      - +
    • An item was mapped twice erroneously again, so I had to remove one of the mappings manually:
    • +
    dspace=# select * from collection2item where item_id = '80278';
    -id   | collection_id | item_id
    +  id   | collection_id | item_id
     -------+---------------+---------
    -92551 |           313 |   80278
    -92550 |           313 |   80278
    -90774 |          1051 |   80278
    + 92551 |           313 |   80278
    + 92550 |           313 |   80278
    + 90774 |          1051 |   80278
     (3 rows)
     dspace=# delete from collection2item where id = 92551 and item_id = 80278;
     DELETE 1
    -
    - -
  • Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)

  • - -
  • Looks like we’ll be using cg.identifier.ccafsprojectpii as the field name

  • +
      +
    • Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
    • +
    • Looks like we'll be using cg.identifier.ccafsprojectpii as the field name
    Read more → @@ -331,12 +310,11 @@ DELETE 1

    -

    2017-01-02

    - +

    2017-01-02

    • I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error
    • -
    • I tested on DSpace Test as well and it doesn’t work there either
    • -
    • I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years
    • +
    • I tested on DSpace Test as well and it doesn't work there either
    • +
    • I asked on the dspace-tech mailing list because it seems to be broken, and actually now I'm not sure if we've ever had the sharding task run successfully over all these years
    Read more → @@ -355,25 +333,20 @@ DELETE 1

    -

    2016-12-02

    - +

    2016-12-02

    • CGSpace was down for five hours in the morning while I was sleeping
    • - -
    • While looking in the logs for errors, I see tons of warnings about Atmire MQM:

      - +
    • While looking in the logs for errors, I see tons of warnings about Atmire MQM:
    • +
    2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
    -
    - -
  • I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade

  • - -
  • I’ve raised a ticket with Atmire to ask

  • - -
  • Another worrying error from dspace.log is:

  • +
      +
    • I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade
    • +
    • I've raised a ticket with Atmire to ask
    • +
    • Another worrying error from dspace.log is:
    Read more → diff --git a/docs/tags/notes/index.xml b/docs/tags/notes/index.xml index d13eb8756..ef93da778 100644 --- a/docs/tags/notes/index.xml +++ b/docs/tags/notes/index.xml @@ -17,16 +17,13 @@ Thu, 07 Sep 2017 16:54:52 +0700 https://alanorth.github.io/cgspace-notes/2017-09/ - <h2 id="2017-09-06">2017-09-06</h2> - + <h2 id="20170906">2017-09-06</h2> <ul> <li>Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two hours</li> </ul> - -<h2 id="2017-09-07">2017-09-07</h2> - +<h2 id="20170907">2017-09-07</h2> <ul> -<li>Ask Sisay to clean up the WLE approvers a bit, as Marianne&rsquo;s user account is both in the approvers step as well as the group</li> +<li>Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group</li> </ul> @@ -36,22 +33,21 @@ Tue, 01 Aug 2017 11:51:52 +0300 https://alanorth.github.io/cgspace-notes/2017-08/ - <h2 id="2017-08-01">2017-08-01</h2> - + <h2 id="20170801">2017-08-01</h2> <ul> <li>Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours</li> <li>I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)</li> <li>The good thing is that, according to <code>dspace.log.2017-08-01</code>, they are all using the same Tomcat session</li> <li>This means our Tomcat Crawler Session Valve is working</li> <li>But many of the bots are browsing dynamic URLs like: - <ul> <li>/handle/10568/3353/discover</li> <li>/handle/10568/16510/browse</li> -</ul></li> +</ul> +</li> <li>The <code>robots.txt</code> only blocks the top-level <code>/discover</code> and <code>/browse</code> URLs&hellip; we will need to find a way to forbid them from accessing these!</li> <li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li> -<li>It turns out that we&rsquo;re already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li> +<li>It turns out that we're already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li> <li>Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;</li> <li>We might actually have to <em>block</em> these requests with HTTP 403 depending on the user agent</li> <li>Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415</li> @@ -67,18 +63,15 @@ Sat, 01 Jul 2017 18:03:52 +0300 https://alanorth.github.io/cgspace-notes/2017-07/ - <h2 id="2017-07-01">2017-07-01</h2> - + <h2 id="20170701">2017-07-01</h2> <ul> <li>Run system updates and reboot DSpace Test</li> </ul> - -<h2 id="2017-07-04">2017-07-04</h2> - +<h2 id="20170704">2017-07-04</h2> <ul> <li>Merge changes for WLE Phase II theme rename (<a href="https://github.com/ilri/DSpace/pull/329">#329</a>)</li> -<li>Looking at extracting the metadata registries from ICARDA&rsquo;s MEL DSpace database so we can compare fields with CGSpace</li> -<li>We can use PostgreSQL&rsquo;s extended output format (<code>-x</code>) plus <code>sed</code> to format the output into quasi XML:</li> +<li>Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace</li> +<li>We can use PostgreSQL's extended output format (<code>-x</code>) plus <code>sed</code> to format the output into quasi XML:</li> </ul> @@ -88,7 +81,7 @@ Thu, 01 Jun 2017 10:14:52 +0300 https://alanorth.github.io/cgspace-notes/2017-06/ - 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we&rsquo;ll create a new sub-community for Phase II and create collections for the research themes there The current &ldquo;Research Themes&rdquo; community will be renamed to &ldquo;WLE Phase I Research Themes&rdquo; Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. + 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we'll create a new sub-community for Phase II and create collections for the research themes there The current &ldquo;Research Themes&rdquo; community will be renamed to &ldquo;WLE Phase I Research Themes&rdquo; Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. @@ -97,7 +90,7 @@ Mon, 01 May 2017 16:21:52 +0200 https://alanorth.github.io/cgspace-notes/2017-05/ - 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it&rsquo;s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire&rsquo;s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. + 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it's a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire's CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. @@ -106,23 +99,18 @@ Sun, 02 Apr 2017 17:08:52 +0200 https://alanorth.github.io/cgspace-notes/2017-04/ - <h2 id="2017-04-02">2017-04-02</h2> - + <h2 id="20170402">2017-04-02</h2> <ul> <li>Merge one change to CCAFS flagships that I had forgotten to remove last month (&ldquo;MANAGING CLIMATE RISK&rdquo;): <a href="https://github.com/ilri/DSpace/pull/317">https://github.com/ilri/DSpace/pull/317</a></li> <li>Quick proof-of-concept hack to add <code>dc.rights</code> to the input form, including some inline instructions/hints:</li> </ul> - -<p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2017/04/dc-rights.png" alt="dc.rights in the submission form" /></p> - +<p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2017/04/dc-rights.png" alt="dc.rights in the submission form"></p> <ul> <li>Remove redundant/duplicate text in the DSpace submission license</li> - -<li><p>Testing the CMYK patch on a collection with 650 items:</p> - +<li>Testing the CMYK patch on a collection with 650 items:</li> +</ul> <pre><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt -</code></pre></li> -</ul> +</code></pre> @@ -131,14 +119,11 @@ Wed, 01 Mar 2017 17:08:52 +0200 https://alanorth.github.io/cgspace-notes/2017-03/ - <h2 id="2017-03-01">2017-03-01</h2> - + <h2 id="20170301">2017-03-01</h2> <ul> <li>Run the 279 CIAT author corrections on CGSpace</li> </ul> - -<h2 id="2017-03-02">2017-03-02</h2> - +<h2 id="20170302">2017-03-02</h2> <ul> <li>Skype with Michael and Peter, discussing moving the CGIAR Library to CGSpace</li> <li>CGIAR people possibly open to moving content, redirecting library.cgiar.org to CGSpace and letting CGSpace resolve their handles</li> @@ -148,13 +133,11 @@ <li>Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI</li> <li>Filed an issue on DSpace issue tracker for the <code>filter-media</code> bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: <a href="https://jira.duraspace.org/browse/DS-3516">DS-3516</a></li> <li>Discovered that the ImageMagic <code>filter-media</code> plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK</li> - -<li><p>Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing regeneration using DSpace 5.x&rsquo;s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999"><sup>10568</sup>&frasl;<sub>51999</sub></a>):</p> - +<li>Interestingly, it seems DSpace 4.x's thumbnails were sRGB, but forcing regeneration using DSpace 5.x's ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999">10568/51999</a>):</li> +</ul> <pre><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000 -</code></pre></li> -</ul> +</code></pre> @@ -163,25 +146,22 @@ Tue, 07 Feb 2017 07:04:52 -0800 https://alanorth.github.io/cgspace-notes/2017-02/ - <h2 id="2017-02-07">2017-02-07</h2> - + <h2 id="20170207">2017-02-07</h2> <ul> -<li><p>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</p> - +<li>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</li> +</ul> <pre><code>dspace=# select * from collection2item where item_id = '80278'; -id | collection_id | item_id + id | collection_id | item_id -------+---------------+--------- -92551 | 313 | 80278 -92550 | 313 | 80278 -90774 | 1051 | 80278 + 92551 | 313 | 80278 + 92550 | 313 | 80278 + 90774 | 1051 | 80278 (3 rows) dspace=# delete from collection2item where id = 92551 and item_id = 80278; DELETE 1 -</code></pre></li> - -<li><p>Create issue on GitHub to track the addition of CCAFS Phase II project tags (<a href="https://github.com/ilri/DSpace/issues/301">#301</a>)</p></li> - -<li><p>Looks like we&rsquo;ll be using <code>cg.identifier.ccafsprojectpii</code> as the field name</p></li> +</code></pre><ul> +<li>Create issue on GitHub to track the addition of CCAFS Phase II project tags (<a href="https://github.com/ilri/DSpace/issues/301">#301</a>)</li> +<li>Looks like we'll be using <code>cg.identifier.ccafsprojectpii</code> as the field name</li> </ul> @@ -191,12 +171,11 @@ DELETE 1 Mon, 02 Jan 2017 10:43:00 +0300 https://alanorth.github.io/cgspace-notes/2017-01/ - <h2 id="2017-01-02">2017-01-02</h2> - + <h2 id="20170102">2017-01-02</h2> <ul> <li>I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error</li> -<li>I tested on DSpace Test as well and it doesn&rsquo;t work there either</li> -<li>I asked on the dspace-tech mailing list because it seems to be broken, and actually now I&rsquo;m not sure if we&rsquo;ve ever had the sharding task run successfully over all these years</li> +<li>I tested on DSpace Test as well and it doesn't work there either</li> +<li>I asked on the dspace-tech mailing list because it seems to be broken, and actually now I'm not sure if we've ever had the sharding task run successfully over all these years</li> </ul> @@ -206,25 +185,20 @@ DELETE 1 Fri, 02 Dec 2016 10:43:00 +0300 https://alanorth.github.io/cgspace-notes/2016-12/ - <h2 id="2016-12-02">2016-12-02</h2> - + <h2 id="20161202">2016-12-02</h2> <ul> <li>CGSpace was down for five hours in the morning while I was sleeping</li> - -<li><p>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</p> - +<li>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</li> +</ul> <pre><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&quot;dc.title&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&quot;THUMBNAIL&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&quot;-1&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;) -</code></pre></li> - -<li><p>I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade</p></li> - -<li><p>I&rsquo;ve raised a ticket with Atmire to ask</p></li> - -<li><p>Another worrying error from dspace.log is:</p></li> +</code></pre><ul> +<li>I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade</li> +<li>I've raised a ticket with Atmire to ask</li> +<li>Another worrying error from dspace.log is:</li> </ul> @@ -234,13 +208,11 @@ DELETE 1 Tue, 01 Nov 2016 09:21:00 +0300 https://alanorth.github.io/cgspace-notes/2016-11/ - <h2 id="2016-11-01">2016-11-01</h2> - + <h2 id="20161101">2016-11-01</h2> <ul> -<li>Add <code>dc.type</code> to the output options for Atmire&rsquo;s Listings and Reports module (<a href="https://github.com/ilri/DSpace/pull/286">#286</a>)</li> +<li>Add <code>dc.type</code> to the output options for Atmire's Listings and Reports module (<a href="https://github.com/ilri/DSpace/pull/286">#286</a>)</li> </ul> - -<p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/11/listings-and-reports.png" alt="Listings and Reports with output type" /></p> +<p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/11/listings-and-reports.png" alt="Listings and Reports with output type"></p> @@ -249,22 +221,19 @@ DELETE 1 Mon, 03 Oct 2016 15:53:00 +0300 https://alanorth.github.io/cgspace-notes/2016-10/ - <h2 id="2016-10-03">2016-10-03</h2> - + <h2 id="20161003">2016-10-03</h2> <ul> <li>Testing adding <a href="https://wiki.duraspace.org/display/DSDOC5x/ORCID+Integration#ORCIDIntegration-EditingexistingitemsusingBatchCSVEditing">ORCIDs to a CSV</a> file for a single item to see if the author orders get messed up</li> <li>Need to test the following scenarios to see how author order is affected: - <ul> <li>ORCIDs only</li> <li>ORCIDs plus normal authors</li> -</ul></li> - -<li><p>I exported a random item&rsquo;s metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</p> - +</ul> +</li> +<li>I exported a random item's metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</li> +</ul> <pre><code>0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X -</code></pre></li> -</ul> +</code></pre> @@ -273,18 +242,15 @@ DELETE 1 Thu, 01 Sep 2016 15:53:00 +0300 https://alanorth.github.io/cgspace-notes/2016-09/ - <h2 id="2016-09-01">2016-09-01</h2> - + <h2 id="20160901">2016-09-01</h2> <ul> <li>Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors</li> -<li>Discuss how the migration of CGIAR&rsquo;s Active Directory to a flat structure will break our LDAP groups in DSpace</li> +<li>Discuss how the migration of CGIAR's Active Directory to a flat structure will break our LDAP groups in DSpace</li> <li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li> - -<li><p>It looks like we might be able to use OUs now, instead of DCs:</p> - +<li>It looks like we might be able to use OUs now, instead of DCs:</li> +</ul> <pre><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot; -</code></pre></li> -</ul> +</code></pre> @@ -293,22 +259,19 @@ DELETE 1 Mon, 01 Aug 2016 15:53:00 +0300 https://alanorth.github.io/cgspace-notes/2016-08/ - <h2 id="2016-08-01">2016-08-01</h2> - + <h2 id="20160801">2016-08-01</h2> <ul> <li>Add updated distribution license from Sisay (<a href="https://github.com/ilri/DSpace/issues/259">#259</a>)</li> <li>Play with upgrading Mirage 2 dependencies in <code>bower.json</code> because most are several versions of out date</li> <li>Bootstrap is at 3.3.0 but upstream is at 3.3.7, and upgrading to anything beyond 3.3.1 breaks glyphicons and probably more</li> <li>bower stuff is a dead end, waste of time, too many issues</li> <li>Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of <code>fonts</code>)</li> - -<li><p>Start working on DSpace 5.1 → 5.5 port:</p> - +<li>Start working on DSpace 5.1 → 5.5 port:</li> +</ul> <pre><code>$ git checkout -b 55new 5_x-prod $ git reset --hard ilri/5_x-prod $ git rebase -i dspace-5.5 -</code></pre></li> -</ul> +</code></pre> @@ -317,22 +280,19 @@ $ git rebase -i dspace-5.5 Fri, 01 Jul 2016 10:53:00 +0300 https://alanorth.github.io/cgspace-notes/2016-07/ - <h2 id="2016-07-01">2016-07-01</h2> - + <h2 id="20160701">2016-07-01</h2> <ul> <li>Add <code>dc.description.sponsorship</code> to Discovery sidebar facets and make investors clickable in item view (<a href="https://github.com/ilri/DSpace/issues/232">#232</a>)</li> - -<li><p>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</p> - +<li>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</li> +</ul> <pre><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$'; UPDATE 95 dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$'; -text_value + text_value ------------ (0 rows) -</code></pre></li> - -<li><p>In this case the select query was showing 95 results before the update</p></li> +</code></pre><ul> +<li>In this case the select query was showing 95 results before the update</li> </ul> @@ -342,11 +302,10 @@ text_value Wed, 01 Jun 2016 10:53:00 +0300 https://alanorth.github.io/cgspace-notes/2016-06/ - <h2 id="2016-06-01">2016-06-01</h2> - + <h2 id="20160601">2016-06-01</h2> <ul> <li>Experimenting with IFPRI OAI (we want to harvest their publications)</li> -<li>After reading the <a href="https://www.oclc.org/support/services/contentdm/help/server-admin-help/oai-support.en.html">ContentDM documentation</a> I found IFPRI&rsquo;s OAI endpoint: <a href="http://ebrary.ifpri.org/oai/oai.php">http://ebrary.ifpri.org/oai/oai.php</a></li> +<li>After reading the <a href="https://www.oclc.org/support/services/contentdm/help/server-admin-help/oai-support.en.html">ContentDM documentation</a> I found IFPRI's OAI endpoint: <a href="http://ebrary.ifpri.org/oai/oai.php">http://ebrary.ifpri.org/oai/oai.php</a></li> <li>After reading the <a href="https://www.openarchives.org/OAI/openarchivesprotocol.html">OAI documentation</a> and testing with an <a href="http://validator.oaipmh.com/">OAI validator</a> I found out how to get their publications</li> <li>This is their publications set: <a href="http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&amp;from=2016-01-01&amp;set=p15738coll2&amp;metadataPrefix=oai_dc">http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&amp;from=2016-01-01&amp;set=p15738coll2&amp;metadataPrefix=oai_dc</a></li> <li>You can see the others by using the OAI <code>ListSets</code> verb: <a href="http://ebrary.ifpri.org/oai/oai.php?verb=ListSets">http://ebrary.ifpri.org/oai/oai.php?verb=ListSets</a></li> @@ -360,18 +319,15 @@ text_value Sun, 01 May 2016 23:06:00 +0300 https://alanorth.github.io/cgspace-notes/2016-05/ - <h2 id="2016-05-01">2016-05-01</h2> - + <h2 id="20160501">2016-05-01</h2> <ul> <li>Since yesterday there have been 10,000 REST errors and the site has been unstable again</li> <li>I have blocked access to the API now</li> - -<li><p>There are 3,000 IPs accessing the REST API in a 24-hour period!</p> - +<li>There are 3,000 IPs accessing the REST API in a 24-hour period!</li> +</ul> <pre><code># awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l 3168 -</code></pre></li> -</ul> +</code></pre> @@ -380,13 +336,12 @@ text_value Mon, 04 Apr 2016 11:06:00 +0300 https://alanorth.github.io/cgspace-notes/2016-04/ - <h2 id="2016-04-04">2016-04-04</h2> - + <h2 id="20160404">2016-04-04</h2> <ul> <li>Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit</li> <li>We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc</li> -<li>After running DSpace for over five years I&rsquo;ve never needed to look in any other log file than dspace.log, leave alone one from last year!</li> -<li>This will save us a few gigs of backup space we&rsquo;re paying for on S3</li> +<li>After running DSpace for over five years I've never needed to look in any other log file than dspace.log, leave alone one from last year!</li> +<li>This will save us a few gigs of backup space we're paying for on S3</li> <li>Also, I noticed the <code>checker</code> log has some errors we should pay attention to:</li> </ul> @@ -397,11 +352,10 @@ text_value Wed, 02 Mar 2016 16:50:00 +0300 https://alanorth.github.io/cgspace-notes/2016-03/ - <h2 id="2016-03-02">2016-03-02</h2> - + <h2 id="20160302">2016-03-02</h2> <ul> <li>Looking at issues with author authorities on CGSpace</li> -<li>For some reason we still have the <code>index-lucene-update</code> cron job active on CGSpace, but I&rsquo;m pretty sure we don&rsquo;t need it as of the latest few versions of Atmire&rsquo;s Listings and Reports module</li> +<li>For some reason we still have the <code>index-lucene-update</code> cron job active on CGSpace, but I'm pretty sure we don't need it as of the latest few versions of Atmire's Listings and Reports module</li> <li>Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server</li> </ul> @@ -412,16 +366,13 @@ text_value Fri, 05 Feb 2016 13:18:00 +0300 https://alanorth.github.io/cgspace-notes/2016-02/ - <h2 id="2016-02-05">2016-02-05</h2> - + <h2 id="20160205">2016-02-05</h2> <ul> <li>Looking at some DAGRIS data for Abenet Yabowork</li> <li>Lots of issues with spaces, newlines, etc causing the import to fail</li> <li>I noticed we have a very <em>interesting</em> list of countries on CGSpace:</li> </ul> - -<p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/02/cgspace-countries.png" alt="CGSpace country list" /></p> - +<p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/02/cgspace-countries.png" alt="CGSpace country list"></p> <ul> <li>Not only are there 49,000 countries, we have some blanks (25)&hellip;</li> <li>Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&rdquo;</li> @@ -434,8 +385,7 @@ text_value Wed, 13 Jan 2016 13:18:00 +0300 https://alanorth.github.io/cgspace-notes/2016-01/ - <h2 id="2016-01-13">2016-01-13</h2> - + <h2 id="20160113">2016-01-13</h2> <ul> <li>Move ILRI collection <code>10568/12503</code> from <code>10568/27869</code> to <code>10568/27629</code> using the <a href="https://gist.github.com/alanorth/392c4660e8b022d99dfa">move_collections.sh</a> script I wrote last year.</li> <li>I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.</li> @@ -449,18 +399,16 @@ text_value Wed, 02 Dec 2015 13:18:00 +0300 https://alanorth.github.io/cgspace-notes/2015-12/ - <h2 id="2015-12-02">2015-12-02</h2> - + <h2 id="20151202">2015-12-02</h2> <ul> -<li><p>Replace <code>lzop</code> with <code>xz</code> in log compression cron jobs on DSpace Test—it uses less space:</p> - +<li>Replace <code>lzop</code> with <code>xz</code> in log compression cron jobs on DSpace Test—it uses less space:</li> +</ul> <pre><code># cd /home/dspacetest.cgiar.org/log # ls -lh dspace.log.2015-11-18* -rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18 -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo -rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz -</code></pre></li> -</ul> +</code></pre> @@ -469,18 +417,15 @@ text_value Mon, 23 Nov 2015 17:00:57 +0300 https://alanorth.github.io/cgspace-notes/2015-11/ - <h2 id="2015-11-22">2015-11-22</h2> - + <h2 id="20151122">2015-11-22</h2> <ul> <li>CGSpace went down</li> <li>Looks like DSpace exhausted its PostgreSQL connection pool</li> - -<li><p>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</p> - +<li>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</li> +</ul> <pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace 78 -</code></pre></li> -</ul> +</code></pre> diff --git a/docs/tags/notes/page/2/index.html b/docs/tags/notes/page/2/index.html index ba046c94b..0612f15a9 100644 --- a/docs/tags/notes/page/2/index.html +++ b/docs/tags/notes/page/2/index.html @@ -9,13 +9,12 @@ - - + @@ -85,13 +84,11 @@

    -

    2016-11-01

    - +

    2016-11-01

      -
    • Add dc.type to the output options for Atmire’s Listings and Reports module (#286)
    • +
    • Add dc.type to the output options for Atmire's Listings and Reports module (#286)
    - -

    Listings and Reports with output type

    +

    Listings and Reports with output type

    Read more → @@ -109,22 +106,19 @@

    -

    2016-10-03

    - +

    2016-10-03

    • Testing adding ORCIDs to a CSV file for a single item to see if the author orders get messed up
    • Need to test the following scenarios to see how author order is affected: -
      • ORCIDs only
      • ORCIDs plus normal authors
      • -
    • - -
    • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:

      - -
      0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
      -
    + +
  • I exported a random item's metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
  • + +
    0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
    +
    Read more → @@ -142,18 +136,15 @@

    -

    2016-09-01

    - +

    2016-09-01

    • Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors
    • -
    • Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace
    • +
    • Discuss how the migration of CGIAR's Active Directory to a flat structure will break our LDAP groups in DSpace
    • We had been using DC=ILRI to determine whether a user was ILRI or not
    • - -
    • It looks like we might be able to use OUs now, instead of DCs:

      - -
      $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
      -
    • +
    • It looks like we might be able to use OUs now, instead of DCs:
    +
    $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
    +
    Read more → @@ -171,22 +162,19 @@

    -

    2016-08-01

    - +

    2016-08-01

    • Add updated distribution license from Sisay (#259)
    • Play with upgrading Mirage 2 dependencies in bower.json because most are several versions of out date
    • Bootstrap is at 3.3.0 but upstream is at 3.3.7, and upgrading to anything beyond 3.3.1 breaks glyphicons and probably more
    • bower stuff is a dead end, waste of time, too many issues
    • Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of fonts)
    • - -
    • Start working on DSpace 5.1 → 5.5 port:

      - +
    • Start working on DSpace 5.1 → 5.5 port:
    • +
    $ git checkout -b 55new 5_x-prod
     $ git reset --hard ilri/5_x-prod
     $ git rebase -i dspace-5.5
    -
    - + Read more → @@ -204,22 +192,19 @@ $ git rebase -i dspace-5.5

    -

    2016-07-01

    - +

    2016-07-01

    • Add dc.description.sponsorship to Discovery sidebar facets and make investors clickable in item view (#232)
    • - -
    • I think this query should find and replace all authors that have “,” at the end of their names:

      - +
    • I think this query should find and replace all authors that have “,” at the end of their names:
    • +
    dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
     UPDATE 95
     dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
    -text_value
    + text_value
     ------------
     (0 rows)
    -
    - -
  • In this case the select query was showing 95 results before the update

  • +
      +
    • In this case the select query was showing 95 results before the update
    Read more → @@ -238,11 +223,10 @@ text_value

    -

    2016-06-01

    - +

    2016-06-01

    + Read more → @@ -294,13 +275,12 @@ text_value

    -

    2016-04-04

    - +

    2016-04-04

    • Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit
    • We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc
    • -
    • After running DSpace for over five years I’ve never needed to look in any other log file than dspace.log, leave alone one from last year!
    • -
    • This will save us a few gigs of backup space we’re paying for on S3
    • +
    • After running DSpace for over five years I've never needed to look in any other log file than dspace.log, leave alone one from last year!
    • +
    • This will save us a few gigs of backup space we're paying for on S3
    • Also, I noticed the checker log has some errors we should pay attention to:
    Read more → @@ -320,11 +300,10 @@ text_value

    -

    2016-03-02

    - +

    2016-03-02

    • Looking at issues with author authorities on CGSpace
    • -
    • For some reason we still have the index-lucene-update cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module
    • +
    • For some reason we still have the index-lucene-update cron job active on CGSpace, but I'm pretty sure we don't need it as of the latest few versions of Atmire's Listings and Reports module
    • Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server
    Read more → @@ -344,16 +323,13 @@ text_value

    -

    2016-02-05

    - +

    2016-02-05

    • Looking at some DAGRIS data for Abenet Yabowork
    • Lots of issues with spaces, newlines, etc causing the import to fail
    • I noticed we have a very interesting list of countries on CGSpace:
    - -

    CGSpace country list

    - +

    CGSpace country list

    • Not only are there 49,000 countries, we have some blanks (25)…
    • Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE”
    • diff --git a/docs/tags/notes/page/3/index.html b/docs/tags/notes/page/3/index.html index a22cc7fad..7030000cb 100644 --- a/docs/tags/notes/page/3/index.html +++ b/docs/tags/notes/page/3/index.html @@ -9,13 +9,12 @@ - - + @@ -85,8 +84,7 @@

      -

      2016-01-13

      - +

      2016-01-13

      • Move ILRI collection 10568/12503 from 10568/27869 to 10568/27629 using the move_collections.sh script I wrote last year.
      • I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.
      • @@ -109,18 +107,16 @@

        -

        2015-12-02

        - +

        2015-12-02

          -
        • Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less space:

          - +
        • Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less space:
        • +
        # cd /home/dspacetest.cgiar.org/log
         # ls -lh dspace.log.2015-11-18*
         -rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
         -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
         -rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz
        -
        -
      + Read more → @@ -138,18 +134,15 @@

      -

      2015-11-22

      - +

      2015-11-22

      • CGSpace went down
      • Looks like DSpace exhausted its PostgreSQL connection pool
      • - -
      • Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:

        - +
      • Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:
      • +
      $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
       78
      -
      -
    + Read more → diff --git a/docs/tags/page/2/index.html b/docs/tags/page/2/index.html index 842c0b17f..fd5330634 100644 --- a/docs/tags/page/2/index.html +++ b/docs/tags/page/2/index.html @@ -9,13 +9,12 @@ - - + @@ -100,40 +99,34 @@

    -

    2019-02-01

    - +

    2019-02-01

    • Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
    • - -
    • The top IPs before, during, and after this latest alert tonight were:

      - +
    • The top IPs before, during, and after this latest alert tonight were:
    • +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    -245 207.46.13.5
    -332 54.70.40.11
    -385 5.143.231.38
    -405 207.46.13.173
    -405 207.46.13.75
    -1117 66.249.66.219
    -1121 35.237.175.180
    -1546 5.9.6.51
    -2474 45.5.186.2
    -5490 85.25.237.71
    -
    - -
  • 85.25.237.71 is the “Linguee Bot” that I first saw last month

  • - -
  • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase

  • - -
  • There were just over 3 million accesses in the nginx logs last month:

    - + 245 207.46.13.5 + 332 54.70.40.11 + 385 5.143.231.38 + 405 207.46.13.173 + 405 207.46.13.75 + 1117 66.249.66.219 + 1121 35.237.175.180 + 1546 5.9.6.51 + 2474 45.5.186.2 + 5490 85.25.237.71 +
      +
    • 85.25.237.71 is the “Linguee Bot” that I first saw last month
    • +
    • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
    • +
    • There were just over 3 million accesses in the nginx logs last month:
    • +
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
     3018243
     
     real    0m19.873s
     user    0m22.203s
     sys     0m1.979s
    -
  • - + Read more → @@ -151,26 +144,23 @@ sys 0m1.979s

    -

    2019-01-02

    - +

    2019-01-02

    • Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
    • - -
    • I don’t see anything interesting in the web server logs around that time though:

      - -
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      - 92 40.77.167.4
      - 99 210.7.29.100
      -120 38.126.157.45
      -177 35.237.175.180
      -177 40.77.167.32
      -216 66.249.75.219
      -225 18.203.76.93
      -261 46.101.86.248
      -357 207.46.13.1
      -903 54.70.40.11
      -
    • +
    • I don't see anything interesting in the web server logs around that time though:
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +     92 40.77.167.4
    +     99 210.7.29.100
    +    120 38.126.157.45
    +    177 35.237.175.180
    +    177 40.77.167.32
    +    216 66.249.75.219
    +    225 18.203.76.93
    +    261 46.101.86.248
    +    357 207.46.13.1
    +    903 54.70.40.11
    +
    Read more → @@ -188,16 +178,13 @@ sys 0m1.979s

    -

    2018-12-01

    - +

    2018-12-01

    • Switch CGSpace (linode18) to use OpenJDK instead of Oracle JDK
    • I manually installed OpenJDK, then removed Oracle JDK, then re-ran the Ansible playbook to update all configuration files, etc
    • Then I ran all system updates and restarted the server
    - -

    2018-12-02

    - +

    2018-12-02

    @@ -218,15 +205,12 @@ sys 0m1.979s

    -

    2018-11-01

    - +

    2018-11-01

    • Finalize AReS Phase I and Phase II ToRs
    • Send a note about my dspace-statistics-api to the dspace-tech mailing list
    - -

    2018-11-03

    - +

    2018-11-03

    • Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage
    • Today these are the top 10 IPs:
    • @@ -248,11 +232,10 @@ sys 0m1.979s

      -

      2018-10-01

      - +

      2018-10-01

      • Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
      • -
      • I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now
      • +
      • I created a GitHub issue to track this #389, because I'm super busy in Nairobi right now
      Read more → @@ -271,13 +254,12 @@ sys 0m1.979s

      -

      2018-09-02

      - +

      2018-09-02

      • New PostgreSQL JDBC driver version 42.2.5
      • -
      • I’ll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
      • -
      • Also, I’ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month
      • -
      • I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again:
      • +
      • I'll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
      • +
      • Also, I'll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month
      • +
      • I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:
      Read more → @@ -296,27 +278,20 @@ sys 0m1.979s

      -

      2018-08-01

      - +

      2018-08-01

        -
      • DSpace Test had crashed at some point yesterday morning and I see the following in dmesg:

        - +
      • DSpace Test had crashed at some point yesterday morning and I see the following in dmesg:
      • +
      [Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
       [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
       [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      -
      - -
    • Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight

    • - -
    • From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat’s

    • - -
    • I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…

    • - -
    • Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core

    • - -
    • The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes

    • - -
    • I ran all system updates on DSpace Test and rebooted it

    • +
        +
      • Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
      • +
      • From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat's
      • +
      • I'm not sure why Tomcat didn't crash with an OutOfMemoryError…
      • +
      • Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
      • +
      • The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes
      • +
      • I ran all system updates on DSpace Test and rebooted it
      Read more → @@ -335,19 +310,16 @@ sys 0m1.979s

      -

      2018-07-01

      - +

      2018-07-01

        -
      • I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:

        - -
        $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
        -
      • - -
      • During the mvn package stage on the 5.8 branch I kept getting issues with java running out of memory:

        - -
        There is insufficient memory for the Java Runtime Environment to continue.
        -
      • +
      • I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:
      +
      $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
      +
        +
      • During the mvn package stage on the 5.8 branch I kept getting issues with java running out of memory:
      • +
      +
      There is insufficient memory for the Java Runtime Environment to continue.
      +
      Read more → @@ -365,32 +337,27 @@ sys 0m1.979s

      -

      2018-06-04

      - +

      2018-06-04

      • Test the DSpace 5.8 module upgrades from Atmire (#378) -
          -
        • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn’t build
        • -
      • +
      • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn't build
      • +
      +
    • I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
    • - -
    • I proofed and tested the ILRI author corrections that Peter sent back to me this week:

      - +
    • I proofed and tested the ILRI author corrections that Peter sent back to me this week:
    • +
    $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
    -
    - -
  • I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018

  • - -
  • Time to index ~70,000 items on CGSpace:

    - +
      +
    • I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018
    • +
    • Time to index ~70,000 items on CGSpace:
    • +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
     
     real    74m42.646s
     user    8m5.056s
     sys     2m7.289s
    -
  • - + Read more → @@ -408,15 +375,14 @@ sys 2m7.289s

    -

    2018-05-01

    - +

    2018-05-01

    +
  • Then I reduced the JVM heap size from 6144 back to 5120m
  • Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the Ansible infrastructure scripts to support hosts choosing which distribution they want to use
  • diff --git a/docs/tags/page/3/index.html b/docs/tags/page/3/index.html index 52fdd2920..201cf7aab 100644 --- a/docs/tags/page/3/index.html +++ b/docs/tags/page/3/index.html @@ -9,13 +9,12 @@ - - + @@ -100,10 +99,9 @@

    -

    2018-04-01

    - +

    2018-04-01

      -
    • I tried to test something on DSpace Test but noticed that it’s down since god knows when
    • +
    • I tried to test something on DSpace Test but noticed that it's down since god knows when
    • Catalina logs at least show some memory errors yesterday:
    Read more → @@ -123,8 +121,7 @@

    -

    2018-03-02

    - +

    2018-03-02

    • Export a CSV of the IITA community metadata for Martin Mueller
    @@ -145,13 +142,12 @@

    -

    2018-02-01

    - +

    2018-02-01

    • Peter gave feedback on the dc.rights proof of concept that I had sent him last week
    • -
    • We don’t need to distinguish between internal and external works, so that makes it just a simple list
    • +
    • We don't need to distinguish between internal and external works, so that makes it just a simple list
    • Yesterday I figured out how to monitor DSpace sessions using JMX
    • -
    • I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
    • +
    • I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu's munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
    Read more → @@ -170,33 +166,26 @@

    -

    2018-01-02

    - +

    2018-01-02

    • Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time
    • -
    • I didn’t get any load alerts from Linode and the REST and XMLUI logs don’t show anything out of the ordinary
    • +
    • I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary
    • The nginx logs show HTTP 200s until 02/Jan/2018:11:27:17 +0000 when Uptime Robot got an HTTP 500
    • In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
    • - -
    • And just before that I see this:

      - +
    • And just before that I see this:
    • +
    Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
    -
    - -
  • Ah hah! So the pool was actually empty!

  • - -
  • I need to increase that, let’s try to bump it up from 50 to 75

  • - -
  • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw

  • - -
  • I notice this error quite a few times in dspace.log:

    - +
      +
    • Ah hah! So the pool was actually empty!
    • +
    • I need to increase that, let's try to bump it up from 50 to 75
    • +
    • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw
    • +
    • I notice this error quite a few times in dspace.log:
    • +
    2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
     org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.
    -
  • - -
  • And there are many of these errors every day for the past month:

    - +
      +
    • And there are many of these errors every day for the past month:
    • +
    $ grep -c "Error while searching for sidebar facets" dspace.log.*
     dspace.log.2017-11-21:4
     dspace.log.2017-11-22:1
    @@ -241,9 +230,8 @@ dspace.log.2017-12-30:89
     dspace.log.2017-12-31:53
     dspace.log.2018-01-01:45
     dspace.log.2018-01-02:34
    -
  • - -
  • Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains

  • +
      +
    • Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains
    Read more → @@ -262,8 +250,7 @@ dspace.log.2018-01-02:34

    -

    2017-12-01

    - +

    2017-12-01

    • Uptime Robot noticed that CGSpace went down
    • The logs say “Timeout waiting for idle object”
    • @@ -287,27 +274,22 @@ dspace.log.2018-01-02:34

      -

      2017-11-01

      - +

      2017-11-01

      • The CORE developers responded to say they are looking into their bot not respecting our robots.txt
      - -

      2017-11-02

      - +

      2017-11-02

        -
      • Today there have been no hits by CORE and no alerts from Linode (coincidence?)

        - +
      • Today there have been no hits by CORE and no alerts from Linode (coincidence?)
      • +
      # grep -c "CORE" /var/log/nginx/access.log
       0
      -
      - -
    • Generate list of authors on CGSpace for Peter to go through and correct:

      - +
        +
      • Generate list of authors on CGSpace for Peter to go through and correct:
      • +
      dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
       COPY 54701
      -
    • -
    + Read more → @@ -325,17 +307,14 @@ COPY 54701

    -

    2017-10-01

    - +

    2017-10-01

    http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
    -
    - -
  • There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine

  • - -
  • Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections

  • +
      +
    • There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
    • +
    • Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
    Read more → @@ -374,16 +353,13 @@ COPY 54701

    -

    2017-09-06

    - +

    2017-09-06

    • Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two hours
    - -

    2017-09-07

    - +

    2017-09-07

      -
    • Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group
    • +
    • Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group
    Read more → @@ -402,22 +378,21 @@ COPY 54701

    -

    2017-08-01

    - +

    2017-08-01

    • Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
    • I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
    • The good thing is that, according to dspace.log.2017-08-01, they are all using the same Tomcat session
    • This means our Tomcat Crawler Session Valve is working
    • But many of the bots are browsing dynamic URLs like: -
      • /handle/10568/3353/discover
      • /handle/10568/16510/browse
      • -
    • +
    +
  • The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these!
  • Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
  • -
  • It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
  • +
  • It turns out that we're already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
  • Also, the bot has to successfully browse the page first so it can receive the HTTP header…
  • We might actually have to block these requests with HTTP 403 depending on the user agent
  • Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
  • diff --git a/docs/tags/page/4/index.html b/docs/tags/page/4/index.html index 4b60c6cad..4f6f633a2 100644 --- a/docs/tags/page/4/index.html +++ b/docs/tags/page/4/index.html @@ -9,13 +9,12 @@ - - + @@ -100,18 +99,15 @@

    -

    2017-07-01

    - +

    2017-07-01

    • Run system updates and reboot DSpace Test
    - -

    2017-07-04

    - +

    2017-07-04

    • Merge changes for WLE Phase II theme rename (#329)
    • -
    • Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace
    • -
    • We can use PostgreSQL’s extended output format (-x) plus sed to format the output into quasi XML:
    • +
    • Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace
    • +
    • We can use PostgreSQL's extended output format (-x) plus sed to format the output into quasi XML:
    Read more → @@ -130,7 +126,7 @@

    - 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we’ll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. + 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we'll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. Read more → @@ -148,7 +144,7 @@

    - 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it’s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire’s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. + 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it's a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire's CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. Read more → @@ -166,23 +162,18 @@

    -

    2017-04-02

    - +

    2017-04-02

    • Merge one change to CCAFS flagships that I had forgotten to remove last month (“MANAGING CLIMATE RISK”): https://github.com/ilri/DSpace/pull/317
    • Quick proof-of-concept hack to add dc.rights to the input form, including some inline instructions/hints:
    - -

    dc.rights in the submission form

    - +

    dc.rights in the submission form

    • Remove redundant/duplicate text in the DSpace submission license
    • - -
    • Testing the CMYK patch on a collection with 650 items:

      - -
      $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
      -
    • +
    • Testing the CMYK patch on a collection with 650 items:
    +
    $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
    +
    Read more → @@ -200,14 +191,11 @@

    -

    2017-03-01

    - +

    2017-03-01

    • Run the 279 CIAT author corrections on CGSpace
    - -

    2017-03-02

    - +

    2017-03-02

    • Skype with Michael and Peter, discussing moving the CGIAR Library to CGSpace
    • CGIAR people possibly open to moving content, redirecting library.cgiar.org to CGSpace and letting CGSpace resolve their handles
    • @@ -217,13 +205,11 @@
    • Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
    • Filed an issue on DSpace issue tracker for the filter-media bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
    • Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
    • - -
    • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 1056851999):

      - +
    • Interestingly, it seems DSpace 4.x's thumbnails were sRGB, but forcing regeneration using DSpace 5.x's ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
    • +
    $ identify ~/Desktop/alc_contrastes_desafios.jpg
     /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
    -
    - + Read more → @@ -241,25 +227,22 @@

    -

    2017-02-07

    - +

    2017-02-07

      -
    • An item was mapped twice erroneously again, so I had to remove one of the mappings manually:

      - +
    • An item was mapped twice erroneously again, so I had to remove one of the mappings manually:
    • +
    dspace=# select * from collection2item where item_id = '80278';
    -id   | collection_id | item_id
    +  id   | collection_id | item_id
     -------+---------------+---------
    -92551 |           313 |   80278
    -92550 |           313 |   80278
    -90774 |          1051 |   80278
    + 92551 |           313 |   80278
    + 92550 |           313 |   80278
    + 90774 |          1051 |   80278
     (3 rows)
     dspace=# delete from collection2item where id = 92551 and item_id = 80278;
     DELETE 1
    -
    - -
  • Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)

  • - -
  • Looks like we’ll be using cg.identifier.ccafsprojectpii as the field name

  • +
      +
    • Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
    • +
    • Looks like we'll be using cg.identifier.ccafsprojectpii as the field name
    Read more → @@ -278,12 +261,11 @@ DELETE 1

    -

    2017-01-02

    - +

    2017-01-02

    • I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error
    • -
    • I tested on DSpace Test as well and it doesn’t work there either
    • -
    • I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years
    • +
    • I tested on DSpace Test as well and it doesn't work there either
    • +
    • I asked on the dspace-tech mailing list because it seems to be broken, and actually now I'm not sure if we've ever had the sharding task run successfully over all these years
    Read more → @@ -302,25 +284,20 @@ DELETE 1

    -

    2016-12-02

    - +

    2016-12-02

    • CGSpace was down for five hours in the morning while I was sleeping
    • - -
    • While looking in the logs for errors, I see tons of warnings about Atmire MQM:

      - +
    • While looking in the logs for errors, I see tons of warnings about Atmire MQM:
    • +
    2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
    -
    - -
  • I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade

  • - -
  • I’ve raised a ticket with Atmire to ask

  • - -
  • Another worrying error from dspace.log is:

  • +
      +
    • I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade
    • +
    • I've raised a ticket with Atmire to ask
    • +
    • Another worrying error from dspace.log is:
    Read more → @@ -339,13 +316,11 @@ DELETE 1

    -

    2016-11-01

    - +

    2016-11-01

      -
    • Add dc.type to the output options for Atmire’s Listings and Reports module (#286)
    • +
    • Add dc.type to the output options for Atmire's Listings and Reports module (#286)
    - -

    Listings and Reports with output type

    +

    Listings and Reports with output type

    Read more → @@ -363,22 +338,19 @@ DELETE 1

    -

    2016-10-03

    - +

    2016-10-03

    • Testing adding ORCIDs to a CSV file for a single item to see if the author orders get messed up
    • Need to test the following scenarios to see how author order is affected: -
      • ORCIDs only
      • ORCIDs plus normal authors
      • -
    • - -
    • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:

      - -
      0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
      -
    + +
  • I exported a random item's metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
  • + +
    0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
    +
    Read more → diff --git a/docs/tags/page/5/index.html b/docs/tags/page/5/index.html index e6858dbc1..64983c8e1 100644 --- a/docs/tags/page/5/index.html +++ b/docs/tags/page/5/index.html @@ -9,13 +9,12 @@ - - + @@ -100,18 +99,15 @@

    -

    2016-09-01

    - +

    2016-09-01

    • Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors
    • -
    • Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace
    • +
    • Discuss how the migration of CGIAR's Active Directory to a flat structure will break our LDAP groups in DSpace
    • We had been using DC=ILRI to determine whether a user was ILRI or not
    • - -
    • It looks like we might be able to use OUs now, instead of DCs:

      - -
      $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
      -
    • +
    • It looks like we might be able to use OUs now, instead of DCs:
    +
    $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
    +
    Read more → @@ -129,22 +125,19 @@

    -

    2016-08-01

    - +

    2016-08-01

    • Add updated distribution license from Sisay (#259)
    • Play with upgrading Mirage 2 dependencies in bower.json because most are several versions of out date
    • Bootstrap is at 3.3.0 but upstream is at 3.3.7, and upgrading to anything beyond 3.3.1 breaks glyphicons and probably more
    • bower stuff is a dead end, waste of time, too many issues
    • Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of fonts)
    • - -
    • Start working on DSpace 5.1 → 5.5 port:

      - +
    • Start working on DSpace 5.1 → 5.5 port:
    • +
    $ git checkout -b 55new 5_x-prod
     $ git reset --hard ilri/5_x-prod
     $ git rebase -i dspace-5.5
    -
    - + Read more → @@ -162,22 +155,19 @@ $ git rebase -i dspace-5.5

    -

    2016-07-01

    - +

    2016-07-01

    • Add dc.description.sponsorship to Discovery sidebar facets and make investors clickable in item view (#232)
    • - -
    • I think this query should find and replace all authors that have “,” at the end of their names:

      - +
    • I think this query should find and replace all authors that have “,” at the end of their names:
    • +
    dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
     UPDATE 95
     dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
    -text_value
    + text_value
     ------------
     (0 rows)
    -
    - -
  • In this case the select query was showing 95 results before the update

  • +
      +
    • In this case the select query was showing 95 results before the update
    Read more → @@ -196,11 +186,10 @@ text_value

    -

    2016-06-01

    - +

    2016-06-01

    + Read more → @@ -252,13 +238,12 @@ text_value

    -

    2016-04-04

    - +

    2016-04-04

    • Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit
    • We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc
    • -
    • After running DSpace for over five years I’ve never needed to look in any other log file than dspace.log, leave alone one from last year!
    • -
    • This will save us a few gigs of backup space we’re paying for on S3
    • +
    • After running DSpace for over five years I've never needed to look in any other log file than dspace.log, leave alone one from last year!
    • +
    • This will save us a few gigs of backup space we're paying for on S3
    • Also, I noticed the checker log has some errors we should pay attention to:
    Read more → @@ -278,11 +263,10 @@ text_value

    -

    2016-03-02

    - +

    2016-03-02

    • Looking at issues with author authorities on CGSpace
    • -
    • For some reason we still have the index-lucene-update cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module
    • +
    • For some reason we still have the index-lucene-update cron job active on CGSpace, but I'm pretty sure we don't need it as of the latest few versions of Atmire's Listings and Reports module
    • Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server
    Read more → @@ -302,16 +286,13 @@ text_value

    -

    2016-02-05

    - +

    2016-02-05

    • Looking at some DAGRIS data for Abenet Yabowork
    • Lots of issues with spaces, newlines, etc causing the import to fail
    • I noticed we have a very interesting list of countries on CGSpace:
    - -

    CGSpace country list

    - +

    CGSpace country list

    • Not only are there 49,000 countries, we have some blanks (25)…
    • Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE”
    • @@ -333,8 +314,7 @@ text_value

      -

      2016-01-13

      - +

      2016-01-13

      • Move ILRI collection 10568/12503 from 10568/27869 to 10568/27629 using the move_collections.sh script I wrote last year.
      • I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.
      • @@ -357,18 +337,16 @@ text_value

        -

        2015-12-02

        - +

        2015-12-02

          -
        • Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less space:

          - +
        • Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less space:
        • +
        # cd /home/dspacetest.cgiar.org/log
         # ls -lh dspace.log.2015-11-18*
         -rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
         -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
         -rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz
        -
        -
      + Read more → diff --git a/docs/tags/page/6/index.html b/docs/tags/page/6/index.html index 95d595722..39356fea1 100644 --- a/docs/tags/page/6/index.html +++ b/docs/tags/page/6/index.html @@ -9,13 +9,12 @@ - - + @@ -100,18 +99,15 @@

      -

      2015-11-22

      - +

      2015-11-22

      • CGSpace went down
      • Looks like DSpace exhausted its PostgreSQL connection pool
      • - -
      • Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:

        - +
      • Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:
      • +
      $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
       78
      -
      -
    + Read more →